SIMD

Single instruction, multiple data (SIMD) architectures are able to support data-level parallelism by processing multiple data points with a single instruction. For instance, for a 64-wide vector, a SIMD instruction will send 64 words of data to 64 different ALUs and get back 64 sums within a single clock cycle.

This sounds expensive as shit.

SIMD instructions are really useful for multimedia and vector applications. Notably one application in software is to stream in multiple bytes of (user) input instead of reading byte by byte. See this blog post by Mitchell Hashimoto.

x86 is the main architecture that supports SIMD, which has supported it since the late 90s. SSE was an extension of SIMD to single-precision floating-point numbers. SSE2 extended this to double-precision floats.

Hardware

x86-64 has a few sets of SIMD registers:

xmm0 to xmm15 are 128-bit registers (added in SSE, 1997). We use __m128, __m128d and __m128i intrinsic types to specify f32, f64, and integer data.
ymm0 to ymm15 are 256-bit registers (added in AVX, 2011). We use __m256, __m256d, __m256i.
zmm0 to zmm15 are 512-bit registers (added in AVX512, 2017). We use __m512, __m512d, and __m512i.

Programming

For the most part, compilers still struggle with auto-vectorising code and converting sequential code to SIMD. There are two approaches we can use:

We either need to write our code in a very deliberate way, such that the compiler can recognise it’s vectorisable. See this blog post by Alex Kladov.
Or we can specify SIMD code directly, and the compiler doesn’t need to do any magic. This is done via intrinsics. In C/C++, these intrinsics begin with either one or two underscores __.

Intrinsics

We can specify variables that use SIMD registers with the types listed above. Note that this doesn’t map to a particular register, it just associates the variable name with a type of SIMD register, and a data type (this is for type-checking, since float/integer operations differ).

Specific operations use a naming convention _mm<SIZE>_<ACTION>_<TYPE>.

Resources

Parsing Gigabytes of JSON per Second, by Geoff Langdale and Daniel Lemire

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

SIMD

Hardware

Programming

Intrinsics

Resources

See also

Graph View

Backlinks