Skip to main content

LLM Construction

GPU Compute Model

How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.

AdvancedTier 2CurrentSupporting~50 min

Why This Matters

You cannot optimize what you do not understand. Most ML practitioners treat the GPU as a black box that runs matrix multiplies. This works until you need to understand why Flash Attention is fast, why small batch sizes underutilize hardware, or why fusing two operations together can yield a 2-3x speedup.

The key insight: modern GPUs have far more compute throughput than memory bandwidth. Most ML operations spend more time moving data than computing on it. Understanding this asymmetry is the foundation for every systems-level optimization in ML.

Mental Model

A GPU is a massively parallel processor optimized for throughput, not latency. It has thousands of small cores organized into groups, with a multi-level memory hierarchy. Fast memory is small; large memory is slow. The programmer's job is to keep data in fast memory as long as possible and minimize traffic to slow memory.

Execution Model

Definition

Streaming Multiprocessor

The basic compute unit on an NVIDIA GPU. Each SM contains multiple CUDA cores (execution units for arithmetic), a register file, shared memory (SRAM), and warp schedulers. An A100 has 108 SMs; an H100 has 132 SMs.

Definition

CUDA Core vs Tensor Core

CUDA Cores execute general-purpose scalar arithmetic (FP32, FP64, INT32), one operation per thread per cycle. Tensor Cores execute block matrix multiply-accumulate (BMMA / WMMA / WGMMA) on small fixed-shape tiles (for example 16x8x16 or 16x16x16 for MMA, larger tiles for Hopper WGMMA). Roofline numbers differ by a large factor between the two: on A100 the FP32 CUDA-core peak is 19.5 TFLOP/s, while the FP16 Tensor Core peak is 312 TFLOP/s. Always specify which core a throughput number refers to.

Definition

Warp

A group of 32 threads that execute the same instruction simultaneously (SIMT: single instruction, multiple threads). The warp is the fundamental scheduling unit. All 32 threads in a warp execute in lockstep. Divergent branches cause serialization within the warp.

Definition

Thread Block

A group of threads (up to 1024) assigned to a single SM. Threads within a block can communicate via shared memory and synchronize with barriers. Threads in different blocks cannot directly communicate.

Memory Hierarchy

From fastest to slowest:

LevelSize (A100)BandwidthLatency
Registers256 KB per SM~19 TB/s1 cycle
Shared Memory (SRAM)164 KB per SM~19 TB/s~20 cycles
L2 Cache40 MB~5 TB/s~200 cycles
HBM (Global Memory)80 GB2 TB/s~400 cycles

The ratio of HBM capacity to SRAM capacity is roughly 80,000/184,400×80{,}000 / 18 \approx 4{,}400\times (total SRAM across all SMs is about 18 MB on A100). The bandwidth ratio between registers and HBM is roughly 10×10\times. These ratios are the reason IO-aware algorithms exist.

Arithmetic Intensity and the Roofline Model

Definition

Arithmetic Intensity

The number of floating-point operations performed per byte of data transferred between HBM and the compute units. This is the single most important metric for predicting whether an operation is compute-bound or memory-bound.

Proposition

Roofline Performance Bound

Statement

The achievable throughput of any operation is bounded by:

Throughputmin(P,  IB) FLOP/s\text{Throughput} \leq \min(P, \; I \cdot B) \text{ FLOP/s}

The crossover point I=P/BI^* = P / B is the ridge point. Operations with I<II < I^* are memory-bound; operations with I>II > I^* are compute-bound.

Intuition

If arithmetic intensity is low, the compute units finish their work before the next batch of data arrives from HBM. The GPU sits idle waiting for memory. If arithmetic intensity is high, the memory system delivers data fast enough to keep the compute units busy, and performance hits the compute ceiling.

Proof Sketch

In time TT, the memory system transfers at most BTB \cdot T bytes, which supports at most IBTI \cdot B \cdot T FLOPs. The compute units perform at most PTP \cdot T FLOPs. Actual FLOPs min(PT,IBT)\leq \min(P \cdot T, I \cdot B \cdot T). Dividing by TT gives the throughput bound.

Why It Matters

For an A100 FP16 Tensor Core with P=312P = 312 TFLOP/s and HBM bandwidth B2B \approx 2 TB/s (canonical FlashAttention figure; the exact SKU spec is 1.555 TB/s on A100-40GB and 1.935 TB/s on A100-80GB), the ridge point is I=156I^* = 156 FLOP/byte. Element-wise operations (ReLU, layer norm, softmax) have I1I \approx 1 to 44 FLOP/byte: deeply memory-bound. Only large matrix multiplies with min(M,N,K)156\min(M, N, K) \gtrsim 156 approach compute-bound territory.

Failure Mode

The roofline model assumes perfect overlap of compute and memory access, no cache effects, and no kernel launch overhead. Real performance is always below the roofline. The model identifies the bottleneck but does not predict exact throughput.

Why Batch Size Matters

A single matrix-vector multiply WxWx where WRm×nW \in \mathbb{R}^{m \times n} performs 2mn2mn FLOPs and loads mn+nmn + n elements (the matrix and vector). Arithmetic intensity: approximately 22 FLOP/element, or about 11 FLOP/byte in FP16. This is deeply memory-bound.

A batched matrix multiply WXWX where XRn×BX \in \mathbb{R}^{n \times B} performs 2mnB2mnB FLOPs and loads mn+nBmn + nB elements. As BB grows, the matrix WW is loaded once and reused across all BB vectors. In FP16, arithmetic intensity approaches BB FLOP/byte (FLOPs =2mnB= 2mnB, bytes 2mn\approx 2mn for the weight load). Compute-bound requires II=156I \geq I^* = 156 FLOP/byte on A100-40GB FP16 Tensor Core, so the threshold is roughly B156B \geq 156. Below this, throughput scales linearly with BB; above it, throughput saturates at the 312 TFLOP/s compute ceiling.

This is why training (large batches) achieves much higher GPU utilization than inference (often batch size 1).

Ridge points across current accelerators

Ridge points assume dense FP16 (or BF16) Tensor Core peak divided by HBM peak bandwidth. All numbers are vendor-published specs for dense matmul without structured sparsity.

AcceleratorPeak (dense)HBM bandwidthRidge II^*
A100 40GB SXM, FP16312 TFLOP/s1.555 TB/s200 FLOP/byte
A100 80GB SXM, FP16312 TFLOP/s1.935 TB/s161 FLOP/byte
A100, FP16 (2 TB/s rounded, FlashAttention paper)312 TFLOP/s2.0 TB/s156 FLOP/byte
H100 SXM, FP16/BF16989 TFLOP/s3.35 TB/s295 FLOP/byte
H100 SXM, FP8 (E4M3, E5M2)1979 TFLOP/s3.35 TB/s590 FLOP/byte
AMD MI300X, FP16 Matrix Core1307 TFLOP/s5.3 TB/s247 FLOP/byte

The H100 FP8 formats (E4M3 and E5M2) are defined in Micikevicius et al. 2022 (arXiv:2209.05433). Moving from FP16 to FP8 doubles compute but leaves HBM bandwidth unchanged, so the ridge moves right and more operations become memory-bound at the new, higher intensity threshold.

Attention arithmetic intensity

For a single attention head with sequence length NN and head dimension dd, computing S=QKS = QK^\top costs 2N2d2N^2 d FLOPs and writes an N×NN \times N score matrix (2N22N^2 bytes in FP16). If the score matrix spills to HBM then is reread for the softmax and the PVPV multiply, arithmetic intensity is on the order of dd FLOP/byte. For d=64d = 64 or d=128d = 128 on an A100 (ridge 156), naive attention is memory-bound regardless of how long NN gets. FlashAttention tiles QQ, KK, VV into SRAM-resident blocks and never materializes the full N×NN \times N matrix in HBM. The effective intensity grows with the tile size, pushing the operation past the ridge and onto the compute-bound side of the roofline.

Kernel Launch Overhead

Every GPU operation is submitted as a kernel: a function that runs on the GPU. Each kernel launch incurs overhead:

  • CPU-side dispatch: 5-20 microseconds
  • GPU scheduling: a few microseconds
  • Memory allocation and argument setup

For large matrix multiplies taking milliseconds, this overhead is negligible. For small element-wise operations taking microseconds, the launch overhead can dominate. This is why kernel fusion (combining multiple operations into one kernel) matters: you pay the launch cost once instead of multiple times.

Common Confusions

Watch Out

More CUDA cores does not always mean faster

If an operation is memory-bound (most ML operations are), adding more compute cores does not help. The bottleneck is memory bandwidth. An A100 and an H100 have similar HBM bandwidth (~2-3 TB/s), so memory-bound operations run at similar speeds despite the H100 having far more compute.

Watch Out

GPU memory size and memory bandwidth are different things

An 80 GB GPU does not read 80 GB per second. Memory size determines what fits; memory bandwidth determines how fast you can read/write. The A100 has 80 GB of HBM at 2 TB/s bandwidth. These are independent specifications that constrain different aspects of performance.

Watch Out

SRAM is not L2 cache

Shared memory (SRAM) is explicitly managed by the programmer and local to each SM. L2 cache is hardware-managed and shared across all SMs. They occupy different levels of the hierarchy with different trade-offs. Flash Attention uses shared memory, not L2 cache, because it needs explicit control over what data resides in fast memory.

Summary

  • GPUs have far more compute than memory bandwidth; most ML operations are memory-bound
  • The roofline model: throughput min(P,IB)\leq \min(P, I \cdot B) with ridge point I=P/BI^* = P/B
  • Element-wise operations are deeply memory-bound (I1I \approx 1 to 44); large matmuls approach compute-bound only when min(M,N,K)156\min(M, N, K) \gtrsim 156 on A100 FP16 (295\gtrsim 295 on H100 FP16, 590\gtrsim 590 on H100 FP8)
  • Batch size amortizes the cost of loading weight matrices, converting memory-bound operations to compute-bound (threshold B156B \geq 156 on A100 FP16)
  • Kernel launch overhead matters for small operations; fusion eliminates it

Exercises

ExerciseCore

Problem

An A100 GPU has peak FP16 throughput of 312 TFLOP/s and HBM bandwidth of 2 TB/s. A softmax operation on a vector of length NN performs approximately 5N5N FLOPs (exponentiation, sum, division) and reads/writes 2N2N FP16 elements (4N4N bytes). What is the arithmetic intensity, and is this operation compute-bound or memory-bound?

ExerciseAdvanced

Problem

You are performing inference with a model that has weight matrices of size 4096×40964096 \times 4096 in FP16. At batch size 1, what fraction of peak A100 FP16 compute do you expect to achieve? At what batch size does the operation cross the ridge point?

References

Canonical:

  • NVIDIA CUDA Programming Guide, Chapters 2-5 (execution model and memory hierarchy)
  • Williams, Waterman, Patterson, Roofline: An Insightful Visual Performance Model (2009)

Current:

  • Dao et al., FlashAttention (2022), Section 2 (GPU memory hierarchy background; cites 156 FLOP/byte ridge for A100 FP16)
  • NVIDIA, A100 Tensor Core GPU Architecture Whitepaper (2020), Section 2 (312 TFLOP/s FP16, 1.555-1.935 TB/s HBM depending on SKU)
  • NVIDIA, H100 Tensor Core GPU Architecture Whitepaper (2022), Sections 2-3 (989 TFLOP/s FP16 dense, 1979 TFLOP/s FP8 dense, 3.35 TB/s HBM3)
  • Micikevicius et al., FP8 Formats for Deep Learning (2022), arXiv:2209.05433 (E4M3 and E5M2 definitions)
  • AMD, Instinct MI300X Product Brief (2023) (1307 TFLOP/s FP16 Matrix Core, 5.3 TB/s HBM3)

Next Topics

  • Flash Attention: the most important application of IO-aware algorithm design on GPUs
  • Fused kernels: combining operations to reduce kernel launch overhead and memory traffic

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

7

+2 more on the derived-topics page.