Single instruction, multiple data (SIMD) architectures are able to support data-level parallelism by processing multiple data points with a single instruction. For instance, for a 64-wide vector, a SIMD instruction will send 64 words of data to 64 different ALUs and get back 64 sums within a single clock cycle.
This sounds expensive as shit.
SIMD instructions are really useful for multimedia and vector applications. Notably one application in software is to stream in multiple bytes of (user) input instead of reading byte by byte. See this blog post by Mitchell Hashimoto.
x86 is the main architecture that supports SIMD, which has supported it since the late 90s. SSE was an extension of SIMD to single-precision floating-point numbers. SSE2 extended this to double-precision floats.
Hardware
x86-64 has a few sets of SIMD registers:
xmm0
toxmm15
are 128-bit registers (added in SSE, 1997). We use__m128
,__m128d
and__m128i
intrinsic types to specifyf32
,f64
, and integer data.ymm0
toymm15
are 256-bit registers (added in AVX, 2011). We use__m256
,__m256d
,__m256i
.zmm0
tozmm15
are 512-bit registers (added in AVX512, 2017). We use__m512
,__m512d
, and__m512i
.
Programming
For the most part, compilers still struggle with auto-vectorising code and converting sequential code to SIMD. There are two approaches we can use:
- We either need to write our code in a very deliberate way, such that the compiler can recognise it’s vectorisable. See this blog post by Alex Kladov.
- Or we can specify SIMD code directly, and the compiler doesn’t need to do any magic. This is done via intrinsics. In C/C++, these intrinsics begin with either one or two underscores
__
.
Intrinsics
We can specify variables that use SIMD registers with the types listed above. Note that this doesn’t map to a particular register, it just associates the variable name with a type of SIMD register, and a data type (this is for type-checking, since float/integer operations differ).
Specific operations use a naming convention _mm<SIZE>_<ACTION>_<TYPE>
.
Resources
- Parsing Gigabytes of JSON per Second, by Geoff Langdale and Daniel Lemire
See also
- Vector architecture, a pipelined sequential version of SIMD that other architectures use