Model Timeline
Model Comparison Table
Structured comparison of major LLM families as of April 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.
Why This Matters
Choosing a model for a specific application requires comparing across multiple dimensions: capability, cost, latency, context length, open weights availability, and task-specific performance. No single model dominates on every axis. This page provides a structured factual comparison and explains why naive model comparisons are often misleading.
Model comparison ledger
Compare deployment surfaces, not just names
Snapshot current to April 22, 2026. Use this as a decision frame, then verify exact model IDs and limits in provider docs.
OpenAI
GPT-5.4, GPT-5.4 Pro, GPT-5.4 mini/nano
Undisclosed for the GPT-5 family
GPT-5.4 API docs list 1M context and 128K max output; mini/nano list 400K context.
Complex reasoning, coding, computer use, tool search, and professional workflows.
API IDs, ChatGPT labels, and Codex limits are separate product surfaces. GPT-Rosalind is a trusted-access life-sciences preview, not a general model.
Anthropic
Claude Opus 4.7, Sonnet 4.6, Haiku 4.5
Undisclosed for Claude
Opus 4.7 and Sonnet 4.6 list 1M context; Haiku 4.5 lists 200K.
Coding, long document work, vision, enterprise workflows, and controlled output style.
Mythos Preview is invitation-only for defensive cybersecurity work through Project Glasswing.
Gemini 3.1 previews, Gemini 2.5 stable, Gemma 4
Partly disclosed across generations; do not infer one Gemini endpoint from another.
Gemini 3.1 and Gemini 2.5 text models list million-token-class context; exact limits depend on endpoint.
Long-context multimodal analysis, Google product integration, real-time media, and local Gemma deployment.
Preview endpoints can be renamed or shut down. Production code should prefer documented stable strings unless the newest preview is required.
Open-weight systems
DeepSeek, Llama 4, Qwen3, Gemma, Mistral-family models
Often disclosed: MoE for DeepSeek, Llama 4 Scout/Maverick, and Qwen3-235B-A22B.
Ranges from 32K native windows to Llama 4 Scout's 10M-token model-card claim.
Local control, fine-tuning, inspectability, privacy boundaries, or high-volume inference economics.
Open weights are not automatically cheap. Large MoE models can need enormous memory even when active-parameter compute is low.
If correctness matters
Build a small evaluation set from your real task. Benchmark deltas of one or two points rarely justify switching providers by themselves.
If context matters
Test retrieval at the beginning, middle, and end of the prompt. Advertised context length is not the same as reliable evidence use.
If cost matters
Compare total cost, not token price alone: latency, retries, human review, caching, batch pricing, and error handling change the answer.
If privacy matters
Shortlist open-weight or private-cloud options first, then check whether their quality is sufficient for the task.
- Architecture is listed only when the provider or model card publishes it.
- Parameter count is not a capability ranking, especially for mixture-of-experts models.
- A preview endpoint can be stronger and less production-stable at the same time.
- Leaderboards are useful for shortlist generation, not final procurement decisions.
Provider-Backed Snapshot
This page is a dated decision frame, not a live leaderboard. The snapshot below is current to April 27, 2026 and intentionally separates published facts from unknowns.
Closed hosted frontier models. OpenAI's current GPT page should be read through the GPT-5.5 / GPT-5.5 Pro API and ChatGPT surfaces, with GPT-5.4 still available and GPT-5.3 Instant remaining the lighter everyday-chat surface. Anthropic's current generally available Claude line is Opus 4.7, Sonnet 4.6, and Haiku 4.5. Google's hosted line has stable Gemini 2.5 endpoints and faster-moving Gemini 3 / 3.1 preview endpoints. For all three labs, architecture and parameter counts for the newest closed frontier models are largely undisclosed.
Open-weight families. DeepSeek, Llama 4, Qwen3, Gemma, Mistral-family models, Kimi, and GLM are the families to check when local control, fine-tuning, model inspection, or private deployment matters. The safe rule is simple: use the current model card for exact parameter count, active-parameter count, license, context length, and tool/multimodal claims. Do not copy old numbers across minor releases.
Context-window claims. Treat advertised context length as an input limit, not a guarantee of reliable retrieval. A million-token model can still miss evidence placed in the middle of the prompt. Test with documents shaped like your real workload.
Capability claims. Benchmark wins should be read as evidence for a specific harness on a specific date. They are not a proof that a model is better for your product, dataset, latency target, or safety boundary.
What This Table Does Not Tell You
Benchmarks are noisy
Benchmark Score Variance from Evaluation Design
Statement
If a benchmark has questions and a model's true accuracy is , the observed accuracy has standard error . For questions and , the standard error is . A 95% confidence interval for the true accuracy is approximately . Two models scoring 0.85 and 0.83 on a 500-question benchmark are not meaningfully different.
Intuition
Benchmark scores are sample estimates of a model's capability on a distribution of tasks. With a finite number of questions, there is sampling noise. This connects directly to hypothesis testing: small differences (1-3 percentage points on a few-hundred-question benchmark) are often within the noise margin and should not be interpreted as genuine capability differences.
Proof Sketch
Each question is a Bernoulli trial with success probability . The sample mean of independent Bernoulli trials has variance . By the CLT, the distribution of is approximately normal for large . The 95% CI is .
Why It Matters
Model comparison tables and leaderboards often rank models by benchmark scores with differences of 1-2 percentage points. These differences are frequently within sampling error. Claiming model A is "better" than model B based on a 0.5% difference on MMLU (approximately 14K questions across all subjects, but individual subject subsets have far fewer) is statistically unjustified.
Failure Mode
This bound assumes questions are independent and identically distributed, which they are not in practice. Benchmark questions cluster by topic, difficulty, and format. Models may systematically succeed or fail on certain clusters. The effective sample size is smaller than when questions are correlated, making the true confidence interval wider than the formula suggests.
Benchmarks are not your task
A model that scores highest on MMLU may not be best for your specific application. Benchmarks measure performance on standardized question sets. Your task has specific characteristics: domain vocabulary, input format, output requirements, latency constraints, and cost budget. The only reliable way to choose a model is to evaluate candidates on your actual task with your actual data. See model evaluation best practices for a systematic approach.
Contamination
Many benchmarks are partially or fully present in training data. Models may have memorized specific benchmark questions, inflating their scores. Newer benchmarks (like GPQA, SWE-bench, or LiveCodeBench) are more resistant to contamination, but no benchmark is immune once it becomes widely used.
Pricing changes frequently
API pricing is a competitive lever. Providers cut prices regularly. Any specific pricing comparison is outdated within months. The structural insight is more durable: smaller and faster model tiers usually cost far less per token than maximum-capability reasoning tiers. For many extraction, classification, and routing tasks, the cheaper model is sufficient.
Axes of Comparison
Architecture: Dense vs. MoE
Dense models with disclosed architecture activate all parameters for every token. Mixture-of-experts models activate a subset of experts per token. Public model cards for families such as DeepSeek, Llama 4, Qwen3, Kimi, and GLM report MoE variants; for closed frontier models whose architecture is undisclosed, do not infer dense or MoE from benchmark behavior alone. Both approaches build on the core transformer architecture. The trade-offs:
- MoE advantage: lower compute per token, so faster and cheaper inference for the same quality level.
- MoE disadvantage: higher memory footprint (all experts must be loaded), more complex training (load balancing, expert collapse).
- Dense advantage: simpler to train, deploy, and quantize. No routing overhead.
Context Length
Long-context claims should be grouped by use case, not memorized as a ranking:
- Million-token-class hosted models are useful for whole repositories, large legal records, long transcripts, and multi-document research. GPT-5.5, Claude Opus 4.7 / Sonnet 4.6, and Gemini 2.5 / 3.x models all have documented million-token-class surfaces in at least some product or API settings.
- Open-weight long-context models can be attractive when the privacy boundary matters. Llama 4, Kimi, GLM, Qwen, and DeepSeek-family releases should be checked against the current model card before deployment.
- 128K is already enough for many products. If the task is a short support ticket, a single paper, or a structured extraction job, a larger window may add latency and cost without improving the answer.
Context length matters for tasks that require processing large documents, codebases, or conversation histories. For shorter inputs, a larger context window provides little benefit.
Open Weights
Open weights enable: fine-tuning for specific domains, running inference on your own hardware (no API dependency), inspecting model internals for research, and quantization for deployment on smaller hardware.
The main open-weight shortlist to check in April 2026 is DeepSeek, Llama 4, Qwen3, Gemma, Mistral-family models, Kimi, and GLM. The main closed-hosted shortlist is GPT-5.x, Claude, and Gemini frontier models. That division is about access and deployment control, not absolute capability.
For production applications with data privacy requirements or high-volume inference, open-weight models can be cheaper than API access. At low volume, hosted APIs are often cheaper once hardware, serving, monitoring, and engineer time are counted.
Reasoning Models
A distinct category emerged in 2024-2025: models trained specifically for multi-step reasoning via RL.
- OpenAI: GPT-5.5, GPT-5.4, and earlier o-series releases expose reasoning-time control through product and API settings.
- Anthropic: Claude 4.x models expose extended or adaptive thinking controls in supported API settings.
- Google: Gemini 2.5 and 3.x models make thinking modes part of the Gemini Pro / preview story.
- Open-weight systems: DeepSeek-R1, Qwen3, Kimi, and related releases expose variants of reasoning, tool use, or switchable thinking modes.
These models trade latency for accuracy. On math, coding, and long-horizon agent tasks, extra reasoning can help. On simple routing, extraction, or reformatting, it can waste tokens.
How to Choose a Model
There is no universal best model. The right choice depends on your constraints:
1. What is your task? Coding tasks, multilingual tasks, multimodal tasks, long-context tasks, and low-latency routing tasks favor different shortlists. Start with the task shape, then choose candidates.
2. What are your cost constraints? For high-volume applications, open-weight models with self-hosted inference are cheapest. For low-volume or prototyping, API access is simpler.
3. Do you need open weights? If you need to fine-tune, run on-premise, or inspect model internals, start with open-weight families such as Llama, Qwen, DeepSeek, Kimi, GLM, Gemma, and Mistral.
4. What context length do you need? If you need to process documents longer than 128K tokens, shortlist million-token-class hosted models and the current open-weight long-context model cards. Then test retrieval accuracy at the positions that matter, not just the advertised limit.
5. What latency do you need? Smaller models are often much faster than flagship reasoning models. For real-time applications, latency often matters more than marginal quality.
The most common mistake is choosing based on benchmark rankings. Evaluate on your task. A model that scores 2% lower on MMLU but costs 5x less and is 3x faster may be the correct choice for your application.
Common Confusions
Higher benchmark score does not mean better for your task
MMLU measures broad knowledge. HumanEval measures Python coding. MATH measures competition math. A model that tops one benchmark may underperform on your specific domain. Always evaluate on data representative of your actual use case. Benchmark rankings are starting points for a shortlist, not final decisions.
Parameter count is not a capability measure
An MoE model can report a very large total parameter count while activating far fewer parameters per token. A dense model can report fewer total parameters while doing more per-token compute. Training data quality, post-training, evaluation harness, tool use, and serving setup all matter. Comparing models by parameter count alone is misleading.
Open weights does not mean free
Running a 405B parameter model requires multiple high-end GPUs. The hardware cost for self-hosted inference can exceed API costs at low volume. Open weights are cost-effective only at sufficient scale (typically thousands to millions of requests per day) or when fine-tuning and data privacy are requirements.
Frontier is a moving target
Any specific comparison in this table will be outdated within months. New model releases occur roughly quarterly from each major lab. The structural patterns (MoE vs. dense trade-offs, cost tiers, open vs. closed dynamics) are more durable than specific benchmark numbers.
Exercises
Problem
A benchmark has 1000 questions. Model A scores 87.2% and Model B scores 85.8%. Compute the standard error for each score and determine whether the difference is statistically significant at the 95% confidence level.
Problem
An MoE model has 600B total parameters and activates 40B per token. A dense model has 70B parameters (all active). Compare the per-token compute cost (in terms of FLOPs, proportional to active parameters) and the memory required to load each model in float16.
Problem
You are choosing between three models for a customer support chatbot processing 10 million messages per month. Model A (API): 0.50 USD per 1M input tokens, 95% task accuracy. Model B (open weights, self-hosted): 15,000 USD/month fixed infrastructure cost, unlimited tokens, 93% task accuracy. Model C (API, cheaper): 0.05 USD per 1M input tokens, 88% task accuracy. Average message length: 500 tokens. Which model is cheapest? At what accuracy threshold would you switch from the cheapest to a more expensive option?
References
Canonical:
- Liang et al., "Holistic Evaluation of Language Models" (HELM, Stanford, 2023)
- Srivastava et al., "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models" (BIG-bench, 2023)
Current provider and model-card references:
- LMArena, ongoing human-preference model comparison; use as a dated shortlist aid, not final evaluation.
- Artificial Analysis, ongoing latency, throughput, price, and capability measurements; use with the measurement date attached.
- OpenAI, "Models" API documentation (accessed Apr 27, 2026), https://developers.openai.com/api/docs/models
- OpenAI, "Introducing GPT-5.5" (Apr 23, 2026), https://openai.com/index/introducing-gpt-5-5/
- OpenAI, "GPT-5.5 System Card" (Apr 23, 2026), https://openai.com/index/gpt-5-5-system-card/
- OpenAI, "Introducing GPT-5.4" (Mar 5, 2026), https://openai.com/index/introducing-gpt-5-4/
- OpenAI, "Introducing GPT-Rosalind for life sciences research" (Apr 16, 2026), https://openai.com/index/introducing-gpt-rosalind/
- Anthropic, "Models overview" API documentation (accessed Apr 22, 2026), https://platform.claude.com/docs/en/about-claude/models/overview
- Anthropic, "Introducing Claude Opus 4.7" (Apr 16, 2026), https://www.anthropic.com/news/claude-opus-4-7
- Google AI for Developers, "Gemini models" (accessed Apr 22, 2026), https://ai.google.dev/gemini-api/docs/models
- Google AI for Developers, "Gemini API deprecations" (accessed Apr 22, 2026), https://ai.google.dev/gemini-api/docs/deprecations
- Google, "Gemini 3.1 Pro" (Feb 19, 2026), https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro
- Google, "Gemma 4" (Apr 2, 2026), https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
- Meta, Llama 4 model cards (Apr 5, 2025), https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
- Qwen Team, Qwen3 model card (2025), https://huggingface.co/Qwen/Qwen3-235B-A22B
- Moonshot AI, Kimi K2.6 model card (Apr 2026), https://huggingface.co/moonshotai/Kimi-K2.6
- Z.ai, GLM-5 model card (Feb 2026), https://huggingface.co/zai-org/GLM-5
Model technical reports:
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437
- Guo et al., "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025), arXiv:2501.12948
- DeepSeek API Docs, "DeepSeek-V3.1 Release" (Aug 21, 2025), "Introducing DeepSeek-V3.2-Exp" (Sep 29, 2025), and "DeepSeek-V3.2 Release" (Dec 1, 2025)
- Meta, "The Llama 3 Herd of Models" (2024), arXiv:2407.21783
Benchmark papers:
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (MMLU, 2021), arXiv:2009.03300
- Chen et al., "Evaluating Large Language Models Trained on Code" (HumanEval, 2021), arXiv:2107.03374
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023), arXiv:2310.06770
- Rein et al., "GPQA: A Graduate-Level Google-Proof Q&A Benchmark" (2023), arXiv:2311.12022
- Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code" (2024), arXiv:2403.07974
Next Topics
This is a living reference. Consult primary technical reports and evaluation platforms for the latest data.
Last reviewed: April 27, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- DeepSeek Modelslayer 5 · tier 1
- Mistral Modelslayer 4 · tier 2
- Transformer Architecturelayer 4 · tier 2
- Claude Model Familylayer 5 · tier 2
- Gemini and Google Modelslayer 5 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.