Single instruction, multiple data (SIMD) architectures are able to support data-level parallelism by processing multiple data points with a single instruction. For instance, for a 64-wide vector, a SIMD instruction will send 64 words of data to 64 different ALUs and get back 64 sums within a single clock cycle.

This sounds expensive as shit.

SIMD instructions are really useful for multimedia and vector applications. Notably one application in software is to stream in multiple bytes of (user) input instead of reading byte by byte. See this blog post by Mitchell Hashimoto.

x86 is the main architecture that supports SIMD, which has supported it since the late 90s. SSE was an extension of SIMD to single-precision floating-point numbers. SSE2 extended this to double-precision floats.

https://arxiv.org/pdf/1902.08318

See also