LLM Construction
AMD Competition Landscape
AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.
Prerequisites
Why This Matters
NVIDIA's datacenter accelerator share is widely reported to exceed 80% in 2024 (Jon Peddie Research and Omdia quarterly trackers report figures in the 80-90% range depending on segment definition), built on the GPU compute model. This concentration gives NVIDIA pricing power and creates supply bottlenecks. AMD is the most visible direct competitor for data center AI compute, but it is not the only one. Whether any of these competitors can credibly erode NVIDIA's position affects GPU prices, supply availability, and the risk of vendor lock-in for anyone training or serving large models. The underlying chip supply chain is detailed in ASML and chip manufacturing.
This is not a question of which chip is "better" in isolation. It is a question of market structure and its consequences for AI development. AMD is the primary focus of this page because it is the closest general-purpose alternative, but the 2024 to 2026 landscape also includes Intel, Google, and several specialized accelerator startups. They are covered briefly below.
Mental Model
GPU competition in AI comes down to three factors, roughly in order of importance:
- Software ecosystem: Can existing code run on the hardware with minimal changes?
- Memory capacity and bandwidth: How large a model can you serve, and how fast?
- Compute throughput: Raw FLOPS for matrix multiplications.
NVIDIA leads decisively on (1), AMD is competitive on (2), and both are competitive on (3). The software gap is the binding constraint.
Hardware Comparison
MI300X Specifications
AMD Instinct MI300X (launched late 2023):
- 192GB HBM3 memory (vs. 80GB on H100 SXM)
- 5.3 TB/s memory bandwidth (vs. 3.35 TB/s on H100)
- 1307 TFLOPS BF16 peak (vs. 990 TFLOPS on H100)
- 750W TDP
- CDNA 3 architecture, chiplet design with 8 XCDs
MI325X Specifications
AMD Instinct MI325X (launched 2024):
- 256GB HBM3E memory
- 6.0 TB/s memory bandwidth
- Similar compute to MI300X with architectural refinements
- Targets inference workloads where memory capacity is the bottleneck
The MI300X has 2.4x the memory capacity and 1.6x the memory bandwidth of the H100. For inference of large models where the bottleneck is loading weights from HBM (memory-bandwidth-bound regime), more memory bandwidth directly translates to higher throughput.
Benchmark datapoints (vendor-reported)
The clearest head-to-head numbers come from vendor marketing and must be read as such. A few concrete datapoints:
- Memory capacity per GPU: MI300X has 192 GB HBM3, H100 SXM has 80 GB HBM3, H200 has 141 GB HBM3e. A 70B model in BF16 (140 GB) fits on a single MI300X or H200 but requires two H100 GPUs with tensor parallelism.
- Llama 3 70B inference (AMD-reported): AMD's ROCm 6.2 release notes (late 2024) claim MI300X matches or exceeds H100 throughput on Llama 3 70B decoding at small batch sizes, narrowing the gap from roughly 30% behind (ROCm 6.0) to rough parity. These are AMD numbers; independent benchmarks from MLPerf Inference 2024 showed a smaller but still favorable picture for NVIDIA on H100, especially at large batch sizes.
- FP8 support: Both MI300X and H100 support FP8, but NVIDIA's Transformer Engine has more mature mixed-precision training recipes. AMD's FP8 support shipped later and is less battle-tested.
Treat all vendor benchmark claims skeptically. MLPerf submissions and third-party measurements (SemiAnalysis, ArtificialAnalysis) are more trustworthy than press releases.
The Roofline Perspective
Memory-Bandwidth-Bound Regime
Statement
For a workload with arithmetic intensity (FLOPS per byte of memory accessed), the achievable throughput is:
where is memory bandwidth. When (the ridge point), the workload is memory-bandwidth-bound and throughput scales linearly with bandwidth.
Intuition
LLM inference at small batch sizes is memory-bandwidth-bound (a pattern that interacts with scaling laws as models grow). Each token generation requires loading the entire model's weights from HBM. For a 70B parameter model in FP16 (140GB), generating one token loads 140GB from memory. If batch size is 1, the arithmetic intensity is roughly 1 FLOP/byte (one multiply-add per weight loaded). This is far below the ridge point of modern GPUs (roughly 150-300 FLOPS/byte), so throughput is entirely determined by memory bandwidth.
In this regime, the MI300X's 5.3 TB/s bandwidth gives it a direct advantage over the H100's 3.35 TB/s for single-stream inference latency.
Failure Mode
At large batch sizes, inference becomes compute-bound (arithmetic intensity increases because the same weights serve multiple sequences). In the compute-bound regime, raw FLOPS matter more than bandwidth, and NVIDIA's mature tensor core architecture and higher effective utilization (due to better software) often close or reverse the hardware advantage.
The Software Gap: CUDA vs. ROCm
The most important difference between AMD and NVIDIA for AI workloads is software.
CUDA ecosystem advantages:
- 15+ years of libraries, tooling, and community knowledge
- cuDNN, cuBLAS, TensorRT, NCCL are heavily optimized and battle-tested
- Nearly all ML frameworks (PyTorch, JAX, TensorFlow) were developed CUDA-first
- Third-party libraries (FlashAttention, vLLM, TensorRT-LLM) often launch CUDA-only
- Profiling tools (Nsight, nvprof) are mature
ROCm ecosystem status:
- PyTorch has official ROCm support; most standard training loops work
- HIP is a CUDA-to-ROCm translation layer that handles most CUDA code
- Custom CUDA kernels (FlashAttention, fused operations) require porting effort
- Multi-GPU communication (RCCL vs. NCCL) is functional but less optimized
- Profiling and debugging tools are less mature
- Library coverage is narrower: not all CUDA libraries have ROCm equivalents
The practical consequence: running a standard PyTorch training loop on AMD GPUs works. Running a highly optimized inference stack (custom attention kernels, quantized operations, speculative decoding) requires significant engineering effort to port and tune.
Who Uses AMD GPUs
Several large-scale deployments use AMD MI300X:
- Microsoft Azure offers MI300X instances and uses them internally
- Meta has deployed MI300X clusters for training and inference
- Oracle Cloud offers MI300X instances
These deployments validate that AMD hardware works at scale. But they also highlight that adoption requires dedicated software engineering teams to optimize the ROCm stack for specific workloads.
Beyond AMD: the wider 2024-2026 field
AMD is the most visible direct competitor, but it is not the only one. The accelerator landscape in 2024 to 2026 includes at least four other serious efforts, each with a different architectural bet.
Intel Gaudi 3 (2024)
Intel's third-generation Habana Gaudi accelerator. Key design choices:
- BF16 and FP8 focus, with matrix engines tuned for transformer workloads.
- Ethernet-first networking: 24x 200 Gb/s RoCE Ethernet ports per accelerator, contra NVIDIA NVLink and AMD Infinity Fabric. Bets on standard Ethernet fabric scaling over proprietary interconnects.
- Price-performance positioning: Intel targets lower /GB than H100, trading peak performance for cost.
- Software: SynapseAI stack plus PyTorch support. Less mature than CUDA, comparable in maturity to ROCm.
Google TPU v5p (2023) and v6e Trillium (2024)
Google's custom TPU line, used internally and rented via Google Cloud. Not sold as discrete hardware.
- TPU v5p (2023): trained Gemini 1.0 and Gemini 1.5. Pod scale reaches thousands of chips with custom ICI interconnect.
- TPU v6e "Trillium" (2024): roughly 4.7x the peak compute per chip of TPU v5e according to Google, with bfloat16 and int8 as the primary datatypes. HBM capacity and bandwidth roughly doubled over v5e.
- Cloud-only: you cannot buy a TPU. This is a strategic constraint, not a technical one. It locks adopters into GCP but frees Google from the retail GPU supply chain.
- Software: JAX and TensorFlow are first-class; PyTorch/XLA works but with more friction than CUDA PyTorch.
Specialized accelerators
Several startups target specific parts of the inference or training pipeline rather than competing as general-purpose GPUs:
- Groq LPUs: SRAM-heavy (no HBM), deterministic execution. Excellent decode latency (low tokens-per-user latency for small batches), poor for prefill (long context ingest) because SRAM capacity is limited. Marketed as a real-time inference API rather than a sold chip.
- Cerebras CS-3: wafer-scale engine with roughly 900,000 cores on a single 46,225 mm-squared die. Targets training and inference where all weights fit on-chip. Avoids the off-chip memory bandwidth wall entirely, at the cost of a narrow programming model.
- SambaNova SN40L: reconfigurable dataflow architecture. Compiles models to spatial dataflow graphs rather than issuing instructions. Targets enterprise inference with large memory tiers (DDR plus HBM).
None of these have displaced NVIDIA for general-purpose training. Each is viable in a specific niche: Gaudi 3 on price-sensitive training, TPU on Google-internal and GCP workloads, Groq on real-time decode, Cerebras on dense scientific workloads, SambaNova on enterprise inference. AMD remains the broadest alternative for teams that want a drop-in GPU replacement.
Triton and the erosion of CUDA lock-in
One reason CUDA lock-in may weaken over time is OpenAI Triton, a Python-embedded DSL for writing GPU kernels. Triton started NVIDIA-only but now targets multiple backends:
- NVIDIA CUDA (original target).
- AMD ROCm via the upstream Triton ROCm backend.
- Intel XPU via triton-shared and Intel's fork.
The practical effect: torch.compile, PyTorch's ahead-of-time compiler, uses Triton under the hood to generate fused kernels. A model that relies on torch.compile picks up Triton-generated code automatically, and that code can in principle target any backend with a Triton implementation. This is not yet a complete substitute for hand-tuned CUDA kernels (FlashAttention, cuDNN convolutions, NCCL collectives), but it shifts the lock-in surface from the kernel layer toward libraries and collectives. Every major framework that adopts Triton as its compiler backend incrementally reduces the cost of switching hardware.
Why Competition Matters
The consequences of GPU market concentration:
- Pricing: With limited competition, NVIDIA can price H100/B200 systems at high margins. AMD's MI300X is priced lower per unit of memory and bandwidth.
- Supply: When NVIDIA allocates limited supply, organizations without large purchase commitments cannot access GPUs. AMD provides an alternative supply source.
- Vendor lock-in: Code written for CUDA does not trivially move to other platforms. Organizations that invest heavily in CUDA-specific optimizations face switching costs. This lock-in strengthens NVIDIA's position over time.
- Innovation pressure: Competition forces both vendors to improve. NVIDIA's rapid cadence (Hopper to Blackwell to Rubin) is partly a response to AMD's improving competitiveness.
Common Confusions
More HBM does not always mean faster inference
The MI300X has 192GB vs. H100's 80GB, but this matters only if your model needs more than 80GB. For models that fit on one H100 (e.g., 7B-13B models), the extra memory is unused. The bandwidth advantage is always relevant, but the capacity advantage is model-size-dependent.
Peak FLOPS comparisons are misleading
AMD and NVIDIA report peak FLOPS under different conditions (sparsity, data types, sustained vs. burst). The MI300X's 1307 TFLOPS BF16 and the H100's 990 TFLOPS BF16 are not directly comparable because sustained throughput depends on memory bandwidth, cache behavior, and software efficiency. Real-world kernel benchmarks are the only reliable comparison.
ROCm compatibility does not mean performance parity
A PyTorch model that runs on ROCm may achieve 60-80% of the performance of the same model on CUDA, even on hardware with comparable specs. The gap comes from less-optimized kernels, communication libraries, and memory management. The hardware may be competitive; the software is not yet at parity for all workloads.
Exercises
Problem
A 70B parameter model stored in BF16 requires 140GB of weight data. For single-batch autoregressive inference (one token at a time), estimate the maximum tokens per second on (a) H100 at 3.35 TB/s bandwidth and (b) MI300X at 5.3 TB/s bandwidth. Assume the workload is purely memory-bandwidth-bound.
Problem
At what batch size does inference for a 70B BF16 model transition from memory-bandwidth-bound to compute-bound on an H100? Assume each token requires FLOPs (two FLOPs per parameter for a single forward pass) and the H100 sustains 500 TFLOPS BF16 (roughly half of peak).
References
Canonical:
- Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (CACM 2009)
Current:
- AMD Instinct MI300X Whitepaper (2023)
- AMD Instinct MI325X Datasheet (2024)
- AMD ROCm 6.2 Release Notes, including Llama 3 inference benchmark claims (2024)
- NVIDIA H200 Tensor Core GPU Datasheet (2024)
- Intel Gaudi 3 AI Accelerator White Paper (2024)
- Google Cloud, "Introducing Trillium, the sixth generation of Google TPU" (2024)
- Tillet, Kung, Cox, "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations" (MAPL 2019)
- MLCommons, MLPerf Inference v4.1 Results (2024)
- Jon Peddie Research, quarterly GPU market share reports (2024)
- Omdia, "AI Processor Forecast" quarterly updates (2024)
- Patel, Afzal, "GPU Benchmarking for LLM Inference" (SemiAnalysis, 2024)
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- GPU Compute Modellayer 5 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.