SIMD and Vector Instructions

Why This Matters

A 256-bit AVX2 register holds eight float32 values. One vfmadd231ps instruction can compute eight fused multiply-adds, so a core issuing two such instructions per cycle performs 32 single-precision floating-point operations per cycle. At 3 GHz, that is 96 GFLOP/s before memory limits and instruction scheduling are counted.

Dense linear algebra kernels in ML are built around this fact. Matrix multiply, convolution lowering, layer normalization, and attention projections all spend most cycles in loops with repeated multiply-adds over contiguous arrays. SIMD is the CPU mechanism that turns those scalar operations into lane-parallel work.

Core Definitions

Definition

SIMD

Single instruction, multiple data. A SIMD instruction names one operation and applies it to several fixed-width lanes packed into a vector register. For example, a 128-bit register can hold four 32-bit floats, a 256-bit register can hold eight, and a 512-bit register can hold sixteen.

Definition

Lane

A lane is one element position inside a vector register. For float32 in a 256-bit AVX register, lane 0 is bits 0 through 31, lane 1 is bits 32 through 63, and lane 7 is bits 224 through 255. Most arithmetic is vertical, meaning lane $i$ of the result depends on lane $i$ of each input.

Definition

Fused Multiply-Add

FMA computes $a \times b + c$ as one instruction with one final rounding step. For SIMD floats, _mm256_fmadd_ps(a, b, c) computes eight independent single-precision FMAs in AVX2.

Definition

Horizontal Operation

A horizontal operation combines values across lanes, such as summing all eight lanes of an AVX2 register. Horizontal work is needed for reductions like dot products, but it requires shuffles, extracts, or dedicated horizontal instructions.

The SIMD Register Model

SIMD registers are bit strings interpreted by instructions. The register does not know whether its bytes are floats, integers, masks, or packed bytes. The instruction chooses the interpretation.

A 128-bit XMM register holding four float32 values [1.0, 2.0, -3.5, 0.5] has this little-endian byte layout in memory when stored with an unaligned store.

lane:       0            1            2            3
value:      1.0          2.0          -3.5         0.5
hex bytes:  00 00 80 3f  00 00 00 40  00 00 60 c0  00 00 00 3f
bit range:  31..0        63..32       95..64       127..96

A vertical add keeps lanes separate.

a = [1.0,  2.0,  3.0,  4.0]
b = [10.0, 20.0, 30.0, 40.0]
a + b = [11.0, 22.0, 33.0, 44.0]

No carry or data movement crosses lane boundaries for this float32 add. Integer SIMD has a similar packed form, though integer instructions also come in saturating, widening, and narrowing variants.

The common lane counts are determined by vector width and element size.

width       float32 lanes   float64 lanes   int8 lanes
128 bits    4               2               16
256 bits    8               4               32
512 bits    16              8               64

For ML kernels, float32, bfloat16, float16, int8, and int32 matter most. Accumulation often uses a wider type than inputs, such as int8 multiplication with int32 accumulation.

x86 and ARM SIMD Families

x86 SIMD grew in layers. MMX used 64-bit registers and integer multimedia operations. SSE introduced 128-bit XMM registers and packed single-precision floating-point operations. SSE2 added double-precision and integer operations in XMM registers, making it a baseline for 64-bit x86 software.

AVX widened floating-point vectors to 256-bit YMM registers and introduced a three-operand instruction encoding. Instead of overwriting one input, AVX can write a separate destination.

; SSE style, destination is also an input
addps xmm0, xmm1        ; xmm0 = xmm0 + xmm1

; AVX style, distinct destination
vaddps ymm0, ymm1, ymm2 ; ymm0 = ymm1 + ymm2

AVX2 extended 256-bit support to integer operations and added gathers. FMA3 added fused multiply-add instructions such as vfmadd231ps. AVX-512 added 512-bit ZMM registers, mask registers k0 through k7, more encodings, and many data movement variants. AMX adds tile registers for matrix-style operations, mainly for low-precision dense linear algebra.

AVX-512 masks make predication explicit. A masked add can update only selected lanes.

a      = [1, 2, 3, 4, 5, 6, 7, 8]
b      = [10,10,10,10,10,10,10,10]
mask   = 10110001b
result = [11,2,3,14,15,6,7,18]  with merge masking

The exact bit-to-lane order is instruction-defined, but the programming model is lane predication. Masking removes many scalar tail loops, since the last partial vector can be computed with inactive lanes suppressed. Some processors reduce clock frequency for wide AVX or AVX-512 regions; treat that as a machine-specific scheduling issue, not a change in the programming model.

ARM NEON has fixed 128-bit vector registers. It is widely used on phones, tablets, and ARM servers. NEON code resembles SSE in that vector length is fixed by the ISA. SVE and SVE2 choose a different contract. The program is vector-length agnostic, and the hardware vector length can vary across implementations.

A typical SVE loop uses predicates rather than hard-coding the lane count.

// SVE-like pseudocode, omitting headers and exact types.
for (uint64_t i = 0; i < n; i += svcntw()) {
    pg = svwhilelt_b32(i, n);       // active lanes for i..n-1
    x  = svld1(pg, &a[i]);
    y  = svld1(pg, &b[i]);
    acc = svmla_m(pg, acc, x, y);   // acc = acc + x*y on active lanes
}

The same binary can run with 128-bit, 256-bit, or wider SVE vectors. Correct SVE code asks the hardware for svcntw() and predicates the tail.

Vertical Arithmetic, Horizontal Reductions, and Memory

SIMD arithmetic is cheapest when the loop is vertical and contiguous. Elementwise ReLU, vector addition, and the inner loop of a dot product match the hardware well.

for i in 0..7
    c[i] = a[i] * b[i] + c[i]

That maps directly to one packed FMA for eight float32 lanes on AVX2. The dot product has an extra step. After accumulating partial sums in lanes, the lanes must be reduced to one scalar.

For arrays

a = [1, 2, 3, 4, 5, 6, 7, 8]
b = [0.5, -1, 2, 3, 0, 4, -2, 1]

one AVX2 FMA from zero produces

lane products = [0.5, -2, 6, 12, 0, 24, -14, 8]
sum = 34.5

The vector instruction computes the lane products, but the final 34.5 needs a horizontal sum. Implementations often keep several accumulators to reduce dependency chains, then reduce at the end.

Memory can dominate. A packed load of eight contiguous floats is one 32-byte load. A gather with eight scattered floats can issue one instruction, yet it may touch eight cache lines.

float indices: [0, 16, 32, 48, 64, 80, 96, 112]
byte offsets:  [0, 64,128,192,256,320,384,448]
cache lines:   [0, 1,  2,  3,  4,  5,  6,  7]

With 64-byte cache lines, that gather requests eight lines to obtain 32 useful bytes. Scatters have the same problem and also interact poorly with store buffers and write combining. Dense ML kernels therefore reorder data, pack matrices, or use layouts where the hot inner loop reads contiguous vectors.

Worked Example: AVX2 Dot Product with FMA

The following C function computes a float dot product using AVX2 and FMA intrinsics. It assumes the compiler is invoked with flags such as -O3 -mavx2 -mfma on GCC or Clang.

#include <immintrin.h>
#include <stddef.h>

float dot_avx2_fma(const float *a, const float *b, size_t n) {
    size_t i = 0;
    __m256 acc = _mm256_setzero_ps();

    for (; i + 8 <= n; i += 8) {
        __m256 x = _mm256_loadu_ps(a + i);
        __m256 y = _mm256_loadu_ps(b + i);
        acc = _mm256_fmadd_ps(x, y, acc);
    }

    __m128 lo = _mm256_castps256_ps128(acc);
    __m128 hi = _mm256_extractf128_ps(acc, 1);
    __m128 sum128 = _mm_add_ps(lo, hi);

    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);

    float sum = _mm_cvtss_f32(sum128);

    for (; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

For n = 8 with the arrays above, the loop runs once. The accumulator becomes [0.5, -2, 6, 12, 0, 24, -14, 8]. The extract and add combine the low and high 128-bit halves.

lo = [0.5, -2, 6, 12]
hi = [0, 24, -14, 8]
sum128 = [0.5, 22, -8, 20]
first hadd = [22.5, 12, 22.5, 12]
second hadd = [34.5, 34.5, 34.5, 34.5]

A compiler might produce a loop body resembling this assembly.

vmovups      ymm1, YMMWORD PTR [rdi+rax*4]
vfmadd231ps  ymm0, ymm1, YMMWORD PTR [rsi+rax*4]
add          rax, 8
cmp          rax, rdx
jbe          .Lloop

The intrinsic version fixes the vector shape and FMA use. It still leaves register allocation and instruction scheduling to the compiler.

Auto-Vectorization and Intrinsics

Auto-vectorization works best for counted loops, unit-stride memory, no loop-carried dependency, and known aliasing. This loop is easy.

void saxpy(float *restrict y, const float *restrict x,
           float a, size_t n) {
    for (size_t i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

The restrict qualifiers say that x and y do not overlap. Without that, the compiler may need a runtime alias check or may choose scalar code. The following loop is harder because each iteration writes through an index array.

void scatter_add(float *y, const float *x,
                 const int *idx, size_t n) {
    for (size_t i = 0; i < n; i++) {
        y[idx[i]] += x[i];
    }
}

If two indices are equal, iterations conflict. Even with AVX-512 scatter instructions, the data dependency and irregular memory pattern make this a poor SIMD fit.

Intrinsics are used when the compiler misses an intended vectorization, when the code needs a specific instruction such as _mm256_fmadd_ps, or when a library kernel needs fixed register blocking. The cost is portability. x86 AVX2, AVX-512, NEON, and SVE need separate code paths or an abstraction layer that still exposes layout, alignment, and tail handling.

Key Result

SIMD gives a peak compute ceiling, not an application speed guarantee. For one core, a useful upper bound is

$\text{peak flops per second} = \text{lanes} \times \text{flops per lane instruction} \times \text{instructions per cycle} \times \text{cycles per second}.$

For AVX2 single-precision FMA with eight lanes, two flops per lane, two FMA instructions per cycle, and 3 GHz, the bound is

$8 \times 2 \times 2 \times 3 \times 10^9 = 96 \times 10^9.$

That number assumes the loop has enough independent accumulators, the loads arrive on time, and the front end supplies instructions. A dot product reads two floats for each multiply-add, so it performs 2 flops while reading 8 bytes. Its arithmetic intensity is $0.25$ flops per byte. If a core receives 32 GB/s from its memory path for that stream, the memory-side bound is 8 GFLOP/s, far below the SIMD compute ceiling.

Matrix multiply has much higher reuse. A blocked kernel can load panels of A and B and use them for many FMAs. That is why GEMM kernels can approach SIMD peak while naive reductions often cannot. GPUs extend the same data-parallel idea through SIMT execution, and tensor cores or AMX-style matrix units add instructions that compute small matrix products rather than lane-wise scalar products.

Common Confusions

Watch Out

A Vector Instruction Is Not Automatically Eight Times Faster

Eight lanes reduce the arithmetic instruction count, but they do not remove memory stalls, branch mispredictions, or reduction latency. A loop that gathers eight cache lines to fill one vector can run slower than a scalar loop over contiguous memory.

Watch Out

Horizontal Add Is Not the Same Cost as Vertical Add

_mm256_add_ps adds corresponding lanes. Summing all lanes needs cross-lane movement. On AVX2, crossing the 128-bit halves needs an extract or permute, so reductions are usually delayed until after many vertical FMAs.

Watch Out

SVE Code Should Not Assume 256 Bits

SVE is vector-length agnostic. Hard-coding eight float32 lanes defeats the model and can produce wrong tails or poor code on another SVE machine. Use predicates and svcntw() style loop increments.

Exercises

ExerciseCore

Problem

An AVX2 loop processes float32 arrays with one _mm256_fmadd_ps per iteration. For n = 1000, how many full vector iterations run, how many scalar tail elements remain, and how many floating-point operations are done by the vector iterations?

ExerciseCore

Problem

A 128-bit vector stores four float32 values [4.0, -1.0, 0.25, 8.0] and is written to little-endian memory. Give the 16 bytes in hexadecimal.

ExerciseAdvanced

Problem

The loop below is compiled without vector instructions on your target. Name two specific reasons the compiler may refuse to vectorize it, then rewrite the function signature or loop to remove one reason.

void f(float *y, float *x, int *p, int n) {
    for (int i = 0; i < n; i++) {
        y[i] = y[i] + x[p[i]];
    }
}

References

Canonical:

Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §1.4 and §3.3, quantitative performance and data-level parallelism
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §2.2 and §2.6, memory hierarchy effects on vector throughput
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 5, optimizing program performance and loop-level parallelism
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 6, cache behavior and memory locality
Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 1 (2024), ch. 14 and ch. 15, AVX, FMA, and AVX-512 programming model
Arm, Arm Architecture Reference Manual Supplement: The Scalable Vector Extension for Armv8-A (2018), ch. 2 and ch. 3, SVE predicates and vector-length-agnostic execution

Accessible:

Intel, Intrinsics Guide, instruction signatures, latency, and throughput tables
Arm, Neon Programmer's Guide for Armv8-A, practical NEON data types and intrinsics
GCC, Auto-vectorization in GCC, notes on loop forms, aliasing, and diagnostics

Next Topics

/computationpath/cuda-mental-model
/computationpath/tensor-cores-and-matrix-instructions
/computationpath/cache-hierarchy-and-locality
/computationpath/cpu-performance-roofline