GPU Compute Model

Sneiderman, Robby

LLM Construction

GPU Compute Model

How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Asml and Chip Manufacturing Docker and Containers for ML Kubernetes for ML Workloads Modal Serverless GPU Platform

Quiz (3)Prereq Map

Why This Matters

You cannot optimize what you do not understand. Most ML practitioners treat the GPU as a black box that runs matrix multiplies. This works until you need to understand why Flash Attention is fast, why small batch sizes underutilize hardware, or why fusing two operations together can yield a 2-3x speedup.

The key insight: modern GPUs have far more compute throughput than memory bandwidth. Most ML operations spend more time moving data than computing on it. Understanding this asymmetry is the foundation for every systems-level optimization in ML.

Mental Model

A GPU is a massively parallel processor optimized for throughput, not latency. It has thousands of small cores organized into groups, with a multi-level memory hierarchy. Fast memory is small; large memory is slow. The programmer's job is to keep data in fast memory as long as possible and minimize traffic to slow memory.

Execution Model

Definition

Streaming Multiprocessor $S M$

The basic compute unit on an NVIDIA GPU. Each SM contains multiple CUDA cores (execution units for arithmetic), a register file, shared memory (SRAM), and warp schedulers. An A100 has 108 SMs; an H100 has 132 SMs.

Definition

CUDA Core vs Tensor Core

CUDA Cores execute general-purpose scalar arithmetic (FP32, FP64, INT32), one operation per thread per cycle. Tensor Cores execute block matrix multiply-accumulate (BMMA / WMMA / WGMMA) on small fixed-shape tiles (for example 16x8x16 or 16x16x16 for MMA, larger tiles for Hopper WGMMA). Roofline numbers differ by a large factor between the two: on A100 the FP32 CUDA-core peak is 19.5 TFLOP/s, while the FP16 Tensor Core peak is 312 TFLOP/s. Always specify which core a throughput number refers to.

Definition

Warp

A group of 32 threads that execute the same instruction simultaneously (SIMT: single instruction, multiple threads). The warp is the fundamental scheduling unit. All 32 threads in a warp execute in lockstep. Divergent branches cause serialization within the warp.

Definition

Thread Block

A group of threads (up to 1024) assigned to a single SM. Threads within a block can communicate via shared memory and synchronize with barriers. Threads in different blocks cannot directly communicate.

Memory Hierarchy

From fastest to slowest:

Level	Size (A100)	Bandwidth	Latency
Registers	256 KB per SM	~19 TB/s	1 cycle
Shared Memory (SRAM)	164 KB per SM	~19 TB/s	~20 cycles
L2 Cache	40 MB	~5 TB/s	~200 cycles
HBM (Global Memory)	80 GB	2 TB/s	~400 cycles

The ratio of HBM capacity to SRAM capacity is roughly $80{,}000 / 18 \approx 4{,}400\times$ (total SRAM across all SMs is about 18 MB on A100). The bandwidth ratio between registers and HBM is roughly $10\times$ . These ratios are the reason IO-aware algorithms exist.

Arithmetic Intensity and the Roofline Model

Definition

Arithmetic Intensity $I = FLOPs / bytes$

The number of floating-point operations performed per byte of data transferred between HBM and the compute units. This is the single most important metric for predicting whether an operation is compute-bound or memory-bound.

Proposition

Roofline Performance Bound

Statement

The achievable throughput of any operation is bounded by:

$\text{Throughput} \leq \min(P, \; I \cdot B) \text{ FLOP/s}$

The crossover point $I^* = P / B$ is the ridge point. Operations with $I < I^*$ are memory-bound; operations with $I > I^*$ are compute-bound.

Intuition

If arithmetic intensity is low, the compute units finish their work before the next batch of data arrives from HBM. The GPU sits idle waiting for memory. If arithmetic intensity is high, the memory system delivers data fast enough to keep the compute units busy, and performance hits the compute ceiling.

Proof Sketch

In time $T$ , the memory system transfers at most $B \cdot T$ bytes, which supports at most $I \cdot B \cdot T$ FLOPs. The compute units perform at most $P \cdot T$ FLOPs. Actual FLOPs $\leq \min(P \cdot T, I \cdot B \cdot T)$ . Dividing by $T$ gives the throughput bound.

Why It Matters

For an A100 FP16 Tensor Core with $P = 312$ TFLOP/s and HBM bandwidth $B \approx 2$ TB/s (canonical FlashAttention figure; the exact SKU spec is 1.555 TB/s on A100-40GB and 1.935 TB/s on A100-80GB), the ridge point is $I^* = 156$ FLOP/byte. Element-wise operations (ReLU, layer norm, softmax) have $I \approx 1$ to $4$ FLOP/byte: deeply memory-bound. Only large matrix multiplies with $\min(M, N, K) \gtrsim 156$ approach compute-bound territory.

Failure Mode

The roofline model assumes perfect overlap of compute and memory access, no cache effects, and no kernel launch overhead. Real performance is always below the roofline. The model identifies the bottleneck but does not predict exact throughput.

report a correction →

Why Batch Size Matters

A single matrix-vector multiply $Wx$ where $W \in \mathbb{R}^{m \times n}$ performs $2mn$ FLOPs and loads $mn + n$ elements (the matrix and vector). Arithmetic intensity: approximately $2$ FLOP/element, or about $1$ FLOP/byte in FP16. This is deeply memory-bound.

A batched matrix multiply $WX$ where $X \in \mathbb{R}^{n \times B}$ performs $2mnB$ FLOPs and loads $mn + nB$ elements. As $B$ grows, the matrix $W$ is loaded once and reused across all $B$ vectors. In FP16, arithmetic intensity approaches $B$ FLOP/byte (FLOPs $= 2mnB$ , bytes $\approx 2mn$ for the weight load). Compute-bound requires $I \geq I^* = 156$ FLOP/byte on A100-40GB FP16 Tensor Core, so the threshold is roughly $B \geq 156$ . Below this, throughput scales linearly with $B$ ; above it, throughput saturates at the 312 TFLOP/s compute ceiling.

This is why training (large batches) achieves much higher GPU utilization than inference (often batch size 1).

Ridge points across current accelerators

Ridge points assume dense FP16 (or BF16) Tensor Core peak divided by HBM peak bandwidth. All numbers are vendor-published specs for dense matmul without structured sparsity.

Accelerator	Peak (dense)	HBM bandwidth	Ridge $I^*$
A100 40GB SXM, FP16	312 TFLOP/s	1.555 TB/s	200 FLOP/byte
A100 80GB SXM, FP16	312 TFLOP/s	1.935 TB/s	161 FLOP/byte
A100, FP16 (2 TB/s rounded, FlashAttention paper)	312 TFLOP/s	2.0 TB/s	156 FLOP/byte
H100 SXM, FP16/BF16	989 TFLOP/s	3.35 TB/s	295 FLOP/byte
H100 SXM, FP8 (E4M3, E5M2)	1979 TFLOP/s	3.35 TB/s	590 FLOP/byte
AMD MI300X, FP16 Matrix Core	1307 TFLOP/s	5.3 TB/s	247 FLOP/byte

The H100 FP8 formats (E4M3 and E5M2) are defined in Micikevicius et al. 2022 (arXiv:2209.05433). Moving from FP16 to FP8 doubles compute but leaves HBM bandwidth unchanged, so the ridge moves right and more operations become memory-bound at the new, higher intensity threshold.

Attention arithmetic intensity

For a single attention head with sequence length $N$ and head dimension $d$ , computing $S = QK^\top$ costs $2N^2 d$ FLOPs and writes an $N \times N$ score matrix ( $2N^2$ bytes in FP16). If the score matrix spills to HBM then is reread for the softmax and the $PV$ multiply, arithmetic intensity is on the order of $d$ FLOP/byte. For $d = 64$ or $d = 128$ on an A100 (ridge 156), naive attention is memory-bound regardless of how long $N$ gets. FlashAttention tiles $Q$ , $K$ , $V$ into SRAM-resident blocks and never materializes the full $N \times N$ matrix in HBM. The effective intensity grows with the tile size, pushing the operation past the ridge and onto the compute-bound side of the roofline.

Kernel Launch Overhead

Every GPU operation is submitted as a kernel: a function that runs on the GPU. Each kernel launch incurs overhead:

CPU-side dispatch: 5-20 microseconds
GPU scheduling: a few microseconds
Memory allocation and argument setup

For large matrix multiplies taking milliseconds, this overhead is negligible. For small element-wise operations taking microseconds, the launch overhead can dominate. This is why kernel fusion (combining multiple operations into one kernel) matters: you pay the launch cost once instead of multiple times.

Common Confusions

Watch Out

More CUDA cores does not always mean faster

If an operation is memory-bound (most ML operations are), adding more compute cores does not help. The bottleneck is memory bandwidth. An A100 and an H100 have similar HBM bandwidth (~2-3 TB/s), so memory-bound operations run at similar speeds despite the H100 having far more compute.

Watch Out

GPU memory size and memory bandwidth are different things

An 80 GB GPU does not read 80 GB per second. Memory size determines what fits; memory bandwidth determines how fast you can read/write. The A100 has 80 GB of HBM at 2 TB/s bandwidth. These are independent specifications that constrain different aspects of performance.

Watch Out

SRAM is not L2 cache

Shared memory (SRAM) is explicitly managed by the programmer and local to each SM. L2 cache is hardware-managed and shared across all SMs. They occupy different levels of the hierarchy with different trade-offs. Flash Attention uses shared memory, not L2 cache, because it needs explicit control over what data resides in fast memory.

Summary

GPUs have far more compute than memory bandwidth; most ML operations are memory-bound
The roofline model: throughput $\leq \min(P, I \cdot B)$ with ridge point $I^* = P/B$
Element-wise operations are deeply memory-bound ( $I \approx 1$ to $4$ ); large matmuls approach compute-bound only when $\min(M, N, K) \gtrsim 156$ on A100 FP16 ( $\gtrsim 295$ on H100 FP16, $\gtrsim 590$ on H100 FP8)
Batch size amortizes the cost of loading weight matrices, converting memory-bound operations to compute-bound (threshold $B \geq 156$ on A100 FP16)
Kernel launch overhead matters for small operations; fusion eliminates it

Exercises

ExerciseCore

Problem

An A100 GPU has peak FP16 throughput of 312 TFLOP/s and HBM bandwidth of 2 TB/s. A softmax operation on a vector of length $N$ performs approximately $5N$ FLOPs (exponentiation, sum, division) and reads/writes $2N$ FP16 elements ( $4N$ bytes). What is the arithmetic intensity, and is this operation compute-bound or memory-bound?

ExerciseAdvanced

Problem

You are performing inference with a model that has weight matrices of size $4096 \times 4096$ in FP16. At batch size 1, what fraction of peak A100 FP16 compute do you expect to achieve? At what batch size does the operation cross the ridge point?

References

Canonical:

NVIDIA CUDA Programming Guide, Chapters 2-5 (execution model and memory hierarchy)
Williams, Waterman, Patterson, Roofline: An Insightful Visual Performance Model (2009)

Current:

Dao et al., FlashAttention (2022), Section 2 (GPU memory hierarchy background; cites 156 FLOP/byte ridge for A100 FP16)
NVIDIA, A100 Tensor Core GPU Architecture Whitepaper (2020), Section 2 (312 TFLOP/s FP16, 1.555-1.935 TB/s HBM depending on SKU)
NVIDIA, H100 Tensor Core GPU Architecture Whitepaper (2022), Sections 2-3 (989 TFLOP/s FP16 dense, 1979 TFLOP/s FP8 dense, 3.35 TB/s HBM3)
Micikevicius et al., FP8 Formats for Deep Learning (2022), arXiv:2209.05433 (E4M3 and E5M2 definitions)
AMD, Instinct MI300X Product Brief (2023) (1307 TFLOP/s FP16 Matrix Core, 5.3 TB/s HBM3)

Next Topics

Flash Attention: the most important application of IO-aware algorithm design on GPUs
Fused kernels: combining operations to reduce kernel launch overhead and memory traffic

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Docker and Containers for MLlayer 4 · tier 3
Kubernetes for ML Workloadslayer 4 · tier 3
Modal: Serverless GPU Platformlayer 4 · tier 3
ASML and Chip Manufacturinglayer 5 · tier 3

Derived topics

7

Flash Attentionlayer 5 · tier 2
Fused Kernelslayer 5 · tier 2
CUDA Programming Fundamentalslayer 4 · tier 3
Running ML Workloads on GPUslayer 4 · tier 3
AMD Competition Landscapelayer 5 · tier 3

+2 more on the derived-topics page.

Graph-backed continuations

Flash Attention Fused Kernels AMD Competition Landscape CUDA Programming Fundamentals Megakernels NVIDIA GPU Architectures Running ML Workloads on GPUs