LLM Construction
GPU Compute Model
How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.
Prerequisites
Why This Matters
You cannot optimize what you do not understand. Most ML practitioners treat the GPU as a black box that runs matrix multiplies. This works until you need to understand why Flash Attention is fast, why small batch sizes underutilize hardware, or why fusing two operations together can yield a 2-3x speedup.
The key insight: modern GPUs have far more compute throughput than memory bandwidth. Most ML operations spend more time moving data than computing on it. Understanding this asymmetry is the foundation for every systems-level optimization in ML.
Mental Model
A GPU is a massively parallel processor optimized for throughput, not latency. It has thousands of small cores organized into groups, with a multi-level memory hierarchy. Fast memory is small; large memory is slow. The programmer's job is to keep data in fast memory as long as possible and minimize traffic to slow memory.
Execution Model
Streaming Multiprocessor
The basic compute unit on an NVIDIA GPU. Each SM contains multiple CUDA cores (execution units for arithmetic), a register file, shared memory (SRAM), and warp schedulers. An A100 has 108 SMs; an H100 has 132 SMs.
CUDA Core vs Tensor Core
CUDA Cores execute general-purpose scalar arithmetic (FP32, FP64, INT32), one operation per thread per cycle. Tensor Cores execute block matrix multiply-accumulate (BMMA / WMMA / WGMMA) on small fixed-shape tiles (for example 16x8x16 or 16x16x16 for MMA, larger tiles for Hopper WGMMA). Roofline numbers differ by a large factor between the two: on A100 the FP32 CUDA-core peak is 19.5 TFLOP/s, while the FP16 Tensor Core peak is 312 TFLOP/s. Always specify which core a throughput number refers to.
Warp
A group of 32 threads that execute the same instruction simultaneously (SIMT: single instruction, multiple threads). The warp is the fundamental scheduling unit. All 32 threads in a warp execute in lockstep. Divergent branches cause serialization within the warp.
Thread Block
A group of threads (up to 1024) assigned to a single SM. Threads within a block can communicate via shared memory and synchronize with barriers. Threads in different blocks cannot directly communicate.
Memory Hierarchy
From fastest to slowest:
| Level | Size (A100) | Bandwidth | Latency |
|---|---|---|---|
| Registers | 256 KB per SM | ~19 TB/s | 1 cycle |
| Shared Memory (SRAM) | 164 KB per SM | ~19 TB/s | ~20 cycles |
| L2 Cache | 40 MB | ~5 TB/s | ~200 cycles |
| HBM (Global Memory) | 80 GB | 2 TB/s | ~400 cycles |
The ratio of HBM capacity to SRAM capacity is roughly (total SRAM across all SMs is about 18 MB on A100). The bandwidth ratio between registers and HBM is roughly . These ratios are the reason IO-aware algorithms exist.
Arithmetic Intensity and the Roofline Model
Arithmetic Intensity
The number of floating-point operations performed per byte of data transferred between HBM and the compute units. This is the single most important metric for predicting whether an operation is compute-bound or memory-bound.
Roofline Performance Bound
Statement
The achievable throughput of any operation is bounded by:
The crossover point is the ridge point. Operations with are memory-bound; operations with are compute-bound.
Intuition
If arithmetic intensity is low, the compute units finish their work before the next batch of data arrives from HBM. The GPU sits idle waiting for memory. If arithmetic intensity is high, the memory system delivers data fast enough to keep the compute units busy, and performance hits the compute ceiling.
Proof Sketch
In time , the memory system transfers at most bytes, which supports at most FLOPs. The compute units perform at most FLOPs. Actual FLOPs . Dividing by gives the throughput bound.
Why It Matters
For an A100 FP16 Tensor Core with TFLOP/s and HBM bandwidth TB/s (canonical FlashAttention figure; the exact SKU spec is 1.555 TB/s on A100-40GB and 1.935 TB/s on A100-80GB), the ridge point is FLOP/byte. Element-wise operations (ReLU, layer norm, softmax) have to FLOP/byte: deeply memory-bound. Only large matrix multiplies with approach compute-bound territory.
Failure Mode
The roofline model assumes perfect overlap of compute and memory access, no cache effects, and no kernel launch overhead. Real performance is always below the roofline. The model identifies the bottleneck but does not predict exact throughput.
Why Batch Size Matters
A single matrix-vector multiply where performs FLOPs and loads elements (the matrix and vector). Arithmetic intensity: approximately FLOP/element, or about FLOP/byte in FP16. This is deeply memory-bound.
A batched matrix multiply where performs FLOPs and loads elements. As grows, the matrix is loaded once and reused across all vectors. In FP16, arithmetic intensity approaches FLOP/byte (FLOPs , bytes for the weight load). Compute-bound requires FLOP/byte on A100-40GB FP16 Tensor Core, so the threshold is roughly . Below this, throughput scales linearly with ; above it, throughput saturates at the 312 TFLOP/s compute ceiling.
This is why training (large batches) achieves much higher GPU utilization than inference (often batch size 1).
Ridge points across current accelerators
Ridge points assume dense FP16 (or BF16) Tensor Core peak divided by HBM peak bandwidth. All numbers are vendor-published specs for dense matmul without structured sparsity.
| Accelerator | Peak (dense) | HBM bandwidth | Ridge |
|---|---|---|---|
| A100 40GB SXM, FP16 | 312 TFLOP/s | 1.555 TB/s | 200 FLOP/byte |
| A100 80GB SXM, FP16 | 312 TFLOP/s | 1.935 TB/s | 161 FLOP/byte |
| A100, FP16 (2 TB/s rounded, FlashAttention paper) | 312 TFLOP/s | 2.0 TB/s | 156 FLOP/byte |
| H100 SXM, FP16/BF16 | 989 TFLOP/s | 3.35 TB/s | 295 FLOP/byte |
| H100 SXM, FP8 (E4M3, E5M2) | 1979 TFLOP/s | 3.35 TB/s | 590 FLOP/byte |
| AMD MI300X, FP16 Matrix Core | 1307 TFLOP/s | 5.3 TB/s | 247 FLOP/byte |
The H100 FP8 formats (E4M3 and E5M2) are defined in Micikevicius et al. 2022 (arXiv:2209.05433). Moving from FP16 to FP8 doubles compute but leaves HBM bandwidth unchanged, so the ridge moves right and more operations become memory-bound at the new, higher intensity threshold.
Attention arithmetic intensity
For a single attention head with sequence length and head dimension , computing costs FLOPs and writes an score matrix ( bytes in FP16). If the score matrix spills to HBM then is reread for the softmax and the multiply, arithmetic intensity is on the order of FLOP/byte. For or on an A100 (ridge 156), naive attention is memory-bound regardless of how long gets. FlashAttention tiles , , into SRAM-resident blocks and never materializes the full matrix in HBM. The effective intensity grows with the tile size, pushing the operation past the ridge and onto the compute-bound side of the roofline.
Kernel Launch Overhead
Every GPU operation is submitted as a kernel: a function that runs on the GPU. Each kernel launch incurs overhead:
- CPU-side dispatch: 5-20 microseconds
- GPU scheduling: a few microseconds
- Memory allocation and argument setup
For large matrix multiplies taking milliseconds, this overhead is negligible. For small element-wise operations taking microseconds, the launch overhead can dominate. This is why kernel fusion (combining multiple operations into one kernel) matters: you pay the launch cost once instead of multiple times.
Common Confusions
More CUDA cores does not always mean faster
If an operation is memory-bound (most ML operations are), adding more compute cores does not help. The bottleneck is memory bandwidth. An A100 and an H100 have similar HBM bandwidth (~2-3 TB/s), so memory-bound operations run at similar speeds despite the H100 having far more compute.
GPU memory size and memory bandwidth are different things
An 80 GB GPU does not read 80 GB per second. Memory size determines what fits; memory bandwidth determines how fast you can read/write. The A100 has 80 GB of HBM at 2 TB/s bandwidth. These are independent specifications that constrain different aspects of performance.
SRAM is not L2 cache
Shared memory (SRAM) is explicitly managed by the programmer and local to each SM. L2 cache is hardware-managed and shared across all SMs. They occupy different levels of the hierarchy with different trade-offs. Flash Attention uses shared memory, not L2 cache, because it needs explicit control over what data resides in fast memory.
Summary
- GPUs have far more compute than memory bandwidth; most ML operations are memory-bound
- The roofline model: throughput with ridge point
- Element-wise operations are deeply memory-bound ( to ); large matmuls approach compute-bound only when on A100 FP16 ( on H100 FP16, on H100 FP8)
- Batch size amortizes the cost of loading weight matrices, converting memory-bound operations to compute-bound (threshold on A100 FP16)
- Kernel launch overhead matters for small operations; fusion eliminates it
Exercises
Problem
An A100 GPU has peak FP16 throughput of 312 TFLOP/s and HBM bandwidth of 2 TB/s. A softmax operation on a vector of length performs approximately FLOPs (exponentiation, sum, division) and reads/writes FP16 elements ( bytes). What is the arithmetic intensity, and is this operation compute-bound or memory-bound?
Problem
You are performing inference with a model that has weight matrices of size in FP16. At batch size 1, what fraction of peak A100 FP16 compute do you expect to achieve? At what batch size does the operation cross the ridge point?
References
Canonical:
- NVIDIA CUDA Programming Guide, Chapters 2-5 (execution model and memory hierarchy)
- Williams, Waterman, Patterson, Roofline: An Insightful Visual Performance Model (2009)
Current:
- Dao et al., FlashAttention (2022), Section 2 (GPU memory hierarchy background; cites 156 FLOP/byte ridge for A100 FP16)
- NVIDIA, A100 Tensor Core GPU Architecture Whitepaper (2020), Section 2 (312 TFLOP/s FP16, 1.555-1.935 TB/s HBM depending on SKU)
- NVIDIA, H100 Tensor Core GPU Architecture Whitepaper (2022), Sections 2-3 (989 TFLOP/s FP16 dense, 1979 TFLOP/s FP8 dense, 3.35 TB/s HBM3)
- Micikevicius et al., FP8 Formats for Deep Learning (2022), arXiv:2209.05433 (E4M3 and E5M2 definitions)
- AMD, Instinct MI300X Product Brief (2023) (1307 TFLOP/s FP16 Matrix Core, 5.3 TB/s HBM3)
Next Topics
- Flash Attention: the most important application of IO-aware algorithm design on GPUs
- Fused kernels: combining operations to reduce kernel launch overhead and memory traffic
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Docker and Containers for MLlayer 4 · tier 3
- Kubernetes for ML Workloadslayer 4 · tier 3
- Modal: Serverless GPU Platformlayer 4 · tier 3
- ASML and Chip Manufacturinglayer 5 · tier 3
Derived topics
7- Flash Attentionlayer 5 · tier 2
- Fused Kernelslayer 5 · tier 2
- CUDA Programming Fundamentalslayer 4 · tier 3
- Running ML Workloads on GPUslayer 4 · tier 3
- AMD Competition Landscapelayer 5 · tier 3
+2 more on the derived-topics page.