Skip to main content

Methodology

Hardware for ML Practitioners

Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.

CoreTier 2CurrentSupporting~35 min

Why This Matters

Your hardware choices determine what experiments you can run, how fast you iterate, and how much you spend. Bad hardware decisions waste money. Good ones let you focus on the actual ML problem.

The single most important fact: GPU memory (VRAM), not raw compute, is the bottleneck for most ML workloads. A model that does not fit in VRAM cannot run, regardless of how fast the GPU is.

Mental Model

Think of ML hardware in three tiers:

  1. Development machine: where you write code, debug, run small tests
  2. Training compute: where you run real experiments on real data
  3. Inference hardware: where you serve a trained model

Most practitioners need a good laptop for tier 1 and cloud access for tier 2. Tier 3 depends on your deployment context.

The GPU Memory Bottleneck

Definition

VRAM Bottleneck

A model with PP parameters in 16-bit floating point requires approximately 2P2P bytes of VRAM just to store the weights. During training with Adam, you need roughly 16P16P bytes total (weights, gradients, optimizer states). A model can only train on a GPU if the total memory requirement fits in VRAM.

Concrete numbers for a 7B parameter model at fp16:

  • Weight storage: 2×7×109=142 \times 7 \times 10^9 = 14 GB
  • Training with Adam: 16×7×109112\approx 16 \times 7 \times 10^9 \approx 112 GB

A consumer RTX 3090 has 24 GB VRAM. You can run inference on a 7B model (with quantization), but full fine-tuning requires multi-GPU setups or parameter-efficient methods like LoRA.

Proposition

Training Memory Lower Bound

Statement

Training a model with PP parameters using Adam in mixed precision requires at least 16P16P bytes of GPU memory: 2P2P for fp16 weights, 4P4P for fp32 master weights, 4P4P for first moment, 4P4P for second moment, and 2P2P for gradients.

Intuition

Adam maintains two running averages (momentum and squared gradients) per parameter, both in fp32. Add the master weights in fp32 and the fp16 working copies, and memory usage scales at roughly 8x the raw model size in fp16.

Proof Sketch

Count the buffers: fp16 weights (2P2P), fp32 master weights (4P4P), fp32 first moment (4P4P), fp32 second moment (4P4P), fp16 gradients (2P2P). Sum: 16P16P bytes. Activation memory adds further cost that depends on batch size and sequence length.

Why It Matters

This bound determines which models fit on which GPUs. Dividing 80 GB by 1616 bytes/parameter gives 5B as a naive upper limit, but in practice you must also leave room for activations, the CUDA context (1\approx 1-2 GB), and framework overhead. A realistic A100 80 GB ceiling for full fine-tuning with standard Adam is closer to 3B parameters. Larger models require ZeRO/FSDP sharding, gradient checkpointing, 8-bit optimizers, or parameter-efficient methods like LoRA.

Failure Mode

This estimate ignores activation memory, which can dominate for large batch sizes and long sequences. Activation checkpointing (Chen et al. 2016) trades compute for memory by recomputing activations on the backward pass, cutting activation memory from O(L)O(L) to O(L)O(\sqrt{L}) for an LL-layer network at the cost of one extra forward pass. 8-bit optimizers (Dettmers et al. 2022) reduce the optimizer-state contribution from 8P8P to roughly 2P2P bytes, replacing the 16P16P total with about 10P10P.

Memory-Reduction Techniques

Before buying more GPU, reach for the software knobs. In rough order of impact on the 16P16P Adam bound:

  • bf16 mixed precision: modern GPUs (A100, H100, 40-series) support bfloat16, which has the same exponent range as fp32 and is more stable than fp16 for large-model training. Same memory as fp16, better numerics.
  • ZeRO / FSDP sharding (Rajbhandari et al. 2020): shards optimizer states (ZeRO-1), gradients (ZeRO-2), or parameters (ZeRO-3, equivalent to PyTorch FSDP) across GPUs, cutting per-GPU memory by the world size. ZeRO-Infinity extends this to CPU and NVMe offload.
  • Gradient checkpointing: recompute activations on the backward pass instead of storing them. Cuts activation memory at the cost of one extra forward pass (33%\approx 33\% compute overhead).
  • 8-bit optimizers (bitsandbytes Adam8bit): optimizer states in 8-bit block-quantized form, reducing Adam's 8P8P state to roughly 2P2P.
  • LoRA / QLoRA (Hu et al. 2021; Dettmers et al. 2023): freeze the base weights and train small low-rank update matrices. QLoRA further quantizes the frozen base to 4-bit NF4, letting a 65B model fine-tune on a single 48 GB GPU.
  • FlashAttention (Dao et al. 2022): fused, IO-aware attention kernel that avoids materializing the O(L2)O(L^2) attention matrix, enabling longer contexts and faster training with no loss of accuracy.

Stack these: bf16 + FSDP + gradient checkpointing + LoRA fits most 7B-70B fine-tuning workloads on a handful of A100s.

Consumer GPU Comparison

Prices below are rough snapshots from 2024-2025 and drift with supply.

GPUVRAMApproximate CostGood For
RTX 309024 GB800 USD usedSmall experiments, inference
RTX 409024 GB1600 USD newSame VRAM, FP8 via transformer-engine
A100 (cloud)40/80 GB1-2 USD/hr spotSerious training, bf16 + NVLink
H100 (cloud)80 GB2-4 USD/hr spotLarge-scale training, FP8

The 3090 and 4090 have identical VRAM, but the 4090 (Ada Lovelace) adds FP8 support usable via NVIDIA's transformer-engine, which can roughly double effective throughput on supported layers. VRAM is still the hard constraint on what models you can fit; compute determines how fast you iterate. Cloud prices are on-demand/spot rates and exclude egress, storage, and reserved-capacity discounts.

Throughput and Utilization

A GPU rated for 312 TFLOPS of bf16 throughput rarely sees anywhere near that in practice. Model FLOPs Utilization (MFU), introduced in the PaLM paper (Chowdhery et al. 2022), measures the ratio of observed training throughput to the hardware peak. Well-tuned large-model training runs typically hit 40-55% MFU; poorly configured runs sit below 20%. If your MFU is low, the bottleneck is software (kernels, communication, data loading), not hardware.

When Local GPU Makes Sense

Local hardware is justified when:

  • You run many small experiments daily (the per-hour cloud cost adds up)
  • You have privacy requirements that prohibit cloud uploads
  • You primarily do inference on models that fit in 24 GB
  • You need low-latency iteration cycles (no cloud startup overhead)

Local hardware is not justified when:

  • You do not yet know what workloads you will run
  • You need more than 24 GB VRAM (multi-GPU consumer setups are fragile)
  • You train infrequently (a few times per month)
  • You need to scale up and down based on project demands

Cloud Options

Colab (Google): Free tier gives a T4 (16 GB VRAM) with time limits. Good for learning, bad for real work. Pro tier (10 USD/month) gives better GPUs but still has session limits.

RunPod: On-demand GPU instances. Good prices for A100 and H100. No commitment. Pay by the hour. Good for individual practitioners.

Lambda Cloud: Reserved instances at competitive rates. Better for teams that need consistent capacity.

AWS/GCP/Azure: Enterprise options with the full ecosystem (storage, networking, MLOps tools). Higher prices, more overhead, better for organizations with existing cloud infrastructure.

Operating System

Linux is the default for serious ML work. CUDA, PyTorch, and most ML tooling is Linux-first. Docker containers assume Linux. If you use cloud GPUs, you are using Linux.

For local development, macOS (especially Apple Silicon) works well. Apple's M-series chips have unified memory shared between CPU and GPU, which means a MacBook Pro with 36 GB RAM can run inference on models up to about 30B parameters (quantized). This is excellent for development, prototyping, and running local LLMs.

Windows works but creates friction: path handling, environment management, and CUDA configuration are all harder than on Linux or macOS.

Common Confusions

Watch Out

More FLOPS does not mean you can train larger models

A GPU with higher compute throughput but less VRAM cannot train models that do not fit in memory. The RTX 4090 is about 2x faster than the 3090 in raw FLOPS, but both have 24 GB VRAM. They can train the same models; the 4090 just finishes faster.

Watch Out

Apple Silicon is not a replacement for NVIDIA GPUs

Apple M-series chips run PyTorch via the MPS backend, but training is significantly slower than CUDA on NVIDIA GPUs. Apple Silicon is excellent for inference and development, not for large-scale training.

Exercises

ExerciseCore

Problem

A model has 3B parameters. You want to fine-tune it with Adam in mixed precision. How much VRAM do you need? Can you do this on an RTX 3090?

ExerciseCore

Problem

You are a solo researcher who runs 2-3 training jobs per week, each taking 4-6 hours on an A100. Should you buy a local GPU or use cloud compute? Estimate monthly costs for each option.

References

Canonical:

  • NVIDIA CUDA Programming Guide, Memory Management sections
  • Rajbhandari et al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (2020), Section 2
  • Chen et al., "Training Deep Nets with Sublinear Memory Cost" (2016), arXiv:1604.06174 — gradient checkpointing

Current:

  • Hugging Face, "Model Memory Anatomy" documentation (2024)
  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022)
  • Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR 2022)
  • Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" (NeurIPS 2023)
  • Dettmers et al., "8-bit Optimizers via Block-wise Quantization" (ICLR 2022)
  • Chowdhery et al., "PaLM: Scaling Language Modeling with Pathways" (2022), Section 4 on MFU
  • Lambda Labs GPU Benchmarks (updated regularly)

Next Topics

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2