LLaMA and Open Weight Models

Sneiderman, Robby

Model Timeline

LLaMA and Open Weight Models

The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.

CoreTier 2FrontierReference~50 min

Prerequisites

Transformer Architecture Token Prediction and Language Modeling Scaling Laws Mixture of Experts

Prereq Map

Why This Matters

Before LLaMA (February 2023), state-of-the-art language models were accessible only through APIs controlled by a handful of companies. Researchers could not inspect weights, reproduce results, or build on the models freely. LLaMA changed this by releasing competitive model weights to the research community. Within weeks, the weights leaked publicly. Within months, an ecosystem of fine-tuning, quantization, and inference tools emerged. Open weight models made it possible for individual researchers and small companies to run, study, and modify frontier-class language models.

Open-weight model family at a glance

The pages below trace each release in detail. This table is the index: which model fits which deployment regime, what trained it, and what the headline architectural change was.

Family	Year	Sizes (B params, active / total for MoE)	Architecture	Context window	License	Notable
LLaMA 1	2023	7 / 13 / 33 / 65	dense	2K	research-only	first open competitive model; spawned Alpaca, Vicuna
Llama 2	2023	7 / 13 / 70	dense, GQA	4K	community	first commercial-use open-weight; native chat fine-tune
Llama 3	2024	8 / 70	dense, GQA	8K	community	Meta's tokenizer overhaul; data scale-up
Llama 3.1	2024	8 / 70 / 405	dense, GQA	128K	community	first open frontier model at 405B
Llama 3.2	2024	1 / 3 / 11 / 90	dense + vision	128K	community	small + multimodal track
Llama 3.3	2024	70	dense, GQA	128K	community	refresh of the 70B with stronger post-training
Llama 4	2025	Scout 17/109, Maverick 17/400, Behemoth 288/2000	MoE, native multimodal	up to 10M	community	first MoE family from Meta; longest open context
Mistral 7B / Mixtral	2023-2024	7 / 8x7 / 8x22	dense, MoE	32K	Apache 2.0 / community	sliding-window attention; first open MoE
DeepSeek V2 / V3	2024-2025	21/236 (V2), 37/671 (V3)	MoE, MLA	128K	DeepSeek License	aggressive MoE scaling; price-competitive API
DeepSeek-R1	2025	37 / 671 + distilled 1.5-70B	MoE	128K	DeepSeek License	RL-on-verifiable-reward open frontier
Qwen2.5 / Qwen3	2024-2025	0.5 to 72 (dense) + MoE variants	dense + MoE	up to 128K	Apache / Qwen License	strong multilingual + code
Gemma 2 / 3	2024-2025	2 / 9 / 27	dense	8K-128K	Gemma License	Google's open track; on-device variants

The deployment heuristic. Pick the smallest model that meets your accuracy bar on a representative eval set, not the largest you can afford. A 70B model loaded in 4-bit quantization on an 80GB GPU is a different latency / throughput regime than a 8B model in BF16; the 70B will not always be the better choice once cost-per-token enters. Open weights make this comparison empirical instead of vendor-pitch-driven.

LLaMA 1 (February 2023)

Architecture. Standard decoder-only transformer with modifications from recent literature: RMSNorm (pre-normalization), SwiGLU activation function, and rotary positional embeddings (RoPE). Context length: 2048 tokens.

Sizes. 7B, 13B, 33B, and 65B parameters.

Training. Trained on 1.0T to 1.4T tokens of publicly available data (Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, Stack Exchange). No proprietary data. The 65B model used approximately 2048 A100 GPUs for 21 days.

Key result. LLaMA-13B outperformed GPT-3 (175B) on most benchmarks. LLaMA-65B was competitive with Chinchilla (70B) and PaLM (540B). This demonstrated that smaller models trained on more tokens (following Chinchilla-optimal scaling) can match much larger models trained on fewer tokens.

What it taught us. Two things. First, open data + Chinchilla-optimal training yields competitive models. Second, the research community can do significant work with model weights alone, even without access to training infrastructure. Within weeks, Alpaca (Stanford) fine-tuned LLaMA-7B on 52K instruction-following examples for under 600 USD, producing a model that qualitatively matched GPT-3.5 on simple tasks.

LLaMA 2 (July 2023)

Improvements. Trained on 2T tokens (40% more than LLaMA 1). Context length extended to 4096 tokens. Grouped-Query Attention (GQA) for the 34B and 70B models, reducing memory during inference.

Sizes. 7B, 13B, 34B (not released), and 70B parameters.

Chat models. LLaMA 2 included chat-optimized variants trained with RLHF, similar to the InstructGPT recipe. These were the first open weight models with competitive instruction-following behavior.

License. Released under a custom commercial license allowing use by organizations with fewer than 700M monthly active users. This was a significant shift from LLaMA 1's research-only license.

What it taught us. Open models can include post-training alignment (RLHF) and still be released openly. The commercial license enabled startups and enterprises to build products on open weight models.

LLaMA 3 (April 2024)

Improvements. Trained on over 15T tokens, approximately 7x more than LLaMA 2. Larger vocabulary: 128K tokens (up from 32K). Context length: 8192 tokens, with extended-context variants reaching 128K.

Sizes. 8B and 70B initially, with a 405B model released later.

Key result. LLaMA 3 70B was competitive with GPT-4 on many benchmarks. LLaMA 3 405B approached or matched GPT-4 on most tasks. The 8B model outperformed LLaMA 2 70B on several benchmarks, demonstrating the impact of training on significantly more data.

What it taught us. Training data quantity and quality continue to be the dominant factor. The 8B model trained on 15T tokens outperforms a 70B model trained on 2T tokens on many tasks. This is the Chinchilla insight taken further: keep scaling tokens, not just parameters.

LLaMA 3.1 (July 2024)

The paper arXiv:2407.21783 ("The Llama 3 Herd of Models") documents the full 3.1 release. The headline change is context: all three sizes (8B, 70B, 405B) support 128K tokens, extended through a continued pre-training stage after the initial 8K-context phase.

Post-training. Instruction tuning used synthetic data generation and iterative DPO. The 405B model was used to generate synthetic data for smaller variants — an early example of self-distillation at scale.

Multilingual. Support added for German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Key result. Llama 3.1 405B matched or exceeded GPT-4o on several benchmarks. The 8B model surpassed many 13B-class models from earlier generations, again confirming that more tokens beats more parameters when data scales further than compute.

LLaMA 3.2 (September 2024)

Meta released Llama 3.2 on September 25, 2024, adding two new capability categories.

Lightweight text models (1B, 3B). Designed for on-device and edge deployment. Both support 128K context and are optimized for Arm processors (Qualcomm, MediaTek hardware). Suitable for summarization and instruction following at low power budgets.

Vision models (11B, 90B). Meta's first open weight models with image understanding. Architecture adds adapter weights connecting a pre-trained image encoder to the language model backbone. These models handle chart understanding, image captioning, and visual grounding tasks. The 11B model is a drop-in replacement for the 11B text-only model with vision added.

LLaMA 3.3 (December 2024)

Released December 6, 2024. A single 70B text-only model with updated instruction tuning. It matches Llama 3.1 405B performance on text tasks at roughly one-sixth the serving cost — the result of improved fine-tuning data and training procedures applied to the smaller architecture. Llama 3.3 70B is the practical default for teams that need high capability without the hardware requirements of a 405B model.

Llama 4 (April 2025)

Meta released Llama 4 on April 5, 2025, introducing a mixture-of-experts architecture across the family for the first time.

Scout. 17B active parameters, 16 experts, 109B total parameters. 10 million token context window. Runs on a single GPU. The long context is the headline feature; it allows processing entire codebases or long documents in one pass.

Maverick. 17B active parameters, 128 experts, 400B total parameters. 1 million token context window. Requires multi-GPU deployment for self-hosted use; available via API providers.

Behemoth. 288B active parameters, 16 experts, ~2T total parameters. As of the April 2025 release announcement, Behemoth was still in training and public weights had not been released. Meta described it as a teacher model used for codistillation into Scout and Maverick.

All Llama 4 models are natively multimodal (text and image input). The shift to MoE allows Meta to scale parameter count without proportional increases in inference cost.

The Open Weight Ecosystem

The release of model weights enabled a tooling ecosystem that did not exist when models were API-only.

Fine-Tuning Tools

LoRA (Low-Rank Adaptation). Instead of updating all parameters during fine-tuning, decompose weight updates into low-rank matrices: $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$ . This reduces trainable parameters from $d^2$ to $2dr$ .

QLoRA. Combine 4-bit quantization of the base model with LoRA fine-tuning. This allows fine-tuning a 65B model on a single 48GB GPU, which is impossible with full-precision fine-tuning.

Quantization

Open weight releases have consistently shipped pre-quantized variants (GGUF for CPU inference, GPTQ/AWQ for GPU), making deployment accessible without quantization expertise. For the full treatment of quantization methods (including GPTQ, AWQ, BitNet, and KV cache compression), see quantization theory. What matters for open weight models specifically is that the ecosystem provides quantized variants for every major release: 4-bit GGUF files for llama.cpp CPU inference, and INT4/INT8 GPU variants on HuggingFace. The 405B model in 4-bit requires roughly 200GB, achievable on two or three high-memory nodes. The 8B model in 4-bit requires roughly 4GB, running on a single consumer GPU.

Main Theorems

Proposition

Quantization Error for Uniform Scalar Quantization

Statement

For uniform scalar quantization of a weight $w \in [a, b]$ to $2^k$ levels, the maximum quantization error per weight is:

$|w - Q(w)| \leq \frac{b - a}{2^{k+1}}$

The mean squared quantization error, assuming weights are uniformly distributed in $[a, b]$ , is:

$\mathbb{E}[(w - Q(w))^2] = \frac{(b - a)^2}{12 \cdot 2^{2k}}$

For $k = 4$ (16 levels) and typical weight range, this is small enough that model quality degrades only modestly.

Intuition

Dividing the weight range into $2^k$ equal intervals means each weight is rounded to the nearest grid point. The worst case is when the weight falls exactly between two grid points. More bits means finer grid and smaller error. The practical question is whether the cumulative effect of small per-weight errors degrades model outputs.

Proof Sketch

The maximum distance from any point in $[a, b]$ to the nearest quantization level is half the interval width: $(b-a) / (2 \cdot 2^k) = (b-a) / 2^{k+1}$ . The MSE follows from computing the variance of a uniform distribution on an interval of width $(b-a)/2^k$ .

Why It Matters

This bounds the per-weight error, but the critical question is how errors accumulate through the network. Empirically, 4-bit quantization (with calibration) preserves most model quality. 2-bit quantization causes noticeable degradation. The gap between worst-case bounds and empirical behavior suggests that weight distributions have structure (near-Gaussian, not uniform) that makes quantization more forgiving than worst-case analysis predicts.

Failure Mode

Uniform quantization assumes a uniform weight distribution. Real weight distributions are approximately Gaussian with outliers. Outlier weights far from the mean are poorly quantized because most quantization levels are wasted on the dense central region. Methods like GPTQ and AWQ handle this by using non-uniform quantization or isolating outlier channels.

report a correction →

Mistral and the European Open Weight Tier

Mistral 7B (September 2023). Outperformed LLaMA 2 13B on most benchmarks with only 7B parameters. Introduced sliding window attention for efficient long-context processing.

Mixtral 8x7B (December 2023, arXiv:2401.04088). Mixture-of-experts with 47B total parameters but only 13B active per token, using 8 experts with top-2 routing.

Mixtral 8x22B (April 2024). Released April 17, 2024. 141B total parameters, 39B active per token, 8 experts, 64K context window. Apache 2.0 license. Competitive with LLaMA 3 70B while activating fewer parameters per forward pass.

Mistral Nemo 12B (July 2024). Released July 18, 2024, in collaboration with NVIDIA. 12B dense model with 128K context. Uses a new tokenizer (Tekken, based on Tiktoken) trained on 100+ languages; roughly 30% more efficient than previous Mistral tokenizers at compressing source code and non-English text. Released under Apache 2.0.

Mistral Large 2 (July 2024). Released July 24, 2024. Dense 123B-parameter model, 128K context window. Designed for single-node inference. Strong code generation and math performance, competitive with Llama 3.1 405B on several benchmarks. Released under a research license.

DeepSeek Open Models

DeepSeek-V3 (December 2024, arXiv:2412.19437). 671B total parameters, 37B active per token (MoE). Pre-trained on 14.8T tokens. Introduces Multi-head Latent Attention (MLA) and an auxiliary-loss-free load-balancing strategy. Trained in 2.788M H800 GPU hours — unusually efficient for a model at this scale. Weights open on HuggingFace; competitive with leading closed models on coding and math benchmarks.

DeepSeek-R1 (January 2025, arXiv:2501.12948). Adds reasoning capability to the DeepSeek-V3 backbone via reinforcement learning, without requiring human-labeled reasoning chains. An intermediate model (DeepSeek-R1-Zero) trained on pure RL develops self-reflection and verification behaviors but suffers readability issues. DeepSeek-R1 fixes this by adding a cold-start supervised phase before RL. Performance is competitive with OpenAI-o1 on math, code, and reasoning. Meta and Qwen distillations were also released. This model demonstrated that reasoning capabilities can be injected post-training through RL rather than baked into pre-training data.

Other Significant Open Weight Models

Qwen (Alibaba). The Qwen2 series (2024) produced strong multilingual models at 7B, 14B, 32B, and 72B, competitive with LLaMA 3 at each size class. Qwen3 (announced April 28, 2025) extended this further: flagship MoE is Qwen3-235B-A22B (235B total, 22B active), with six open-weight dense variants from 0.6B to 32B under Apache 2.0. Trained on over 36 trillion tokens with 119-language coverage. The Qwen3-30B-A3B MoE matches QwQ-32B with 10x fewer active parameters.

Gemma (Google). 2B and 7B models (2024) trained on large proprietary datasets; competitive at smaller sizes. Gemma 2 extended to 9B and 27B. Useful for research because Google provides model cards with detailed training data descriptions.

GLM-5 (z.ai, February 2026). 744B total parameters, 40B active per token (MoE). Pre-trained on 28.5T tokens. Integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while preserving 200K-token context capacity. Open-weight release under MIT license on HuggingFace (zai-org/GLM-5). Includes an asynchronous RL infrastructure (Slime) for post-training. API access at chat.z.ai and OpenRouter.

Why Open Weights Matter

Reproducibility. ML research requires running experiments on models. API access does not allow modifying architectures, inspecting internal representations, or controlling inference exactly. Open weights enable mechanistic interpretability, ablation studies, and controlled experiments.

Fine-tuning. Organizations can adapt open models to specific domains (medicine, law, finance) without sending sensitive data to third-party APIs. This addresses privacy and compliance requirements.

Cost. Running inference on your own hardware can be 10-100x cheaper than API access for high-volume applications, especially with quantization.

Resilience. API providers can change pricing, terms, or model behavior at any time. Open weight models provide stability: the weights you download today will work identically forever.

Common Confusions

Watch Out

Open weights is not open source

Releasing model weights is not the same as open source. Open source means releasing code, data, training scripts, and weights under a permissive license. LLaMA releases weights and inference code but not training data or the full training pipeline. You can use the model but cannot fully reproduce the training.

Watch Out

Quantization is not always free

4-bit quantization typically costs 1-3% on benchmarks compared to float16. But this average hides variation: some tasks (especially reasoning-heavy or low-frequency knowledge) degrade more than others. Always evaluate quantized models on your specific task before deploying.

Watch Out

Smaller open models do not replace larger closed models

LLaMA 3.1 405B and Llama 4 Maverick are competitive with closed frontier models on many benchmarks, but for the hardest multi-step reasoning and long-context tasks, the gap narrows more slowly. Benchmark parity does not imply production parity: closed models are also continuously updated.

Watch Out

MoE models are not always cheaper to run

A 400B MoE model activates ~40B parameters per token, making its per-token compute similar to a 40B dense model. But total parameter count determines memory requirements. Llama 4 Maverick (400B total) needs more GPU memory than LLaMA 3.1 70B dense, even though forward-pass FLOPs are similar. Serving MoE models requires loading all expert weights into memory, not just the active ones.

Canonical Examples

Example

QLoRA fine-tuning cost estimate

Fine-tuning LLaMA 2 70B with QLoRA on 50,000 instruction examples. Base model quantized to 4-bit: 35GB, fits on one 48GB A100. LoRA rank $r = 64$ : trainable parameters $\approx 2 \times 64 \times 8192 \times 80 \text{ layers} \approx 84M$ (0.12% of total). Training time: approximately 8 hours on one A100 ($16 on cloud). Compare to full fine-tuning: requires 8x A100 80GB GPUs, ~$500+. QLoRA achieves 95-99% of full fine-tuning quality at 3% of the cost.

Exercises

ExerciseCore

Problem

A LLaMA 2 70B model has 70 billion parameters stored in float16 (2 bytes per parameter). How much GPU memory is needed to load the model for inference? If you quantize to 4-bit (0.5 bytes per parameter), how much memory is needed? How many consumer GPUs with 24GB VRAM would you need in each case?

ExerciseAdvanced

Problem

LoRA approximates a weight update $\Delta W \in \mathbb{R}^{d \times d}$ as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ . For LLaMA 2 70B with hidden dimension $d = 8192$ and rank $r = 16$ , compute the number of trainable parameters per weight matrix and the compression ratio. What rank $r$ would you need to represent an arbitrary $\Delta W$ exactly?

References

Canonical:

Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023)
Touvron et al., "LLaMA 2: Open Foundation and Fine-Tuned Chat Models" (2023)
Dubey et al., "The Llama 3 Herd of Models" (2024), arXiv:2407.21783
Liu et al., "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025), arXiv:2501.12948
Jiang et al., "Mixtral of Experts" (2024), arXiv:2401.04088

Current:

Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2022), ICLR
Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (2023), NeurIPS
Jiang et al., "Mistral 7B" (2023)
Meta AI, "Llama 4" announcement (April 5, 2025), llama.com/models/llama-4
Qwen Team, "Qwen3" (April 28, 2025), qwenlm.github.io/blog/qwen3
z.ai, "GLM-5" (February 2026), huggingface.co/zai-org/GLM-5

Next Topics

Post-training overview: RLHF, DPO, and instruction tuning applied to open models
Mixture of experts: the architecture behind Mixtral, Llama 4, and other sparse models
Quantization theory: GPTQ, AWQ, BitNet, and the full technical treatment of precision reduction

Last reviewed: April 29, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

10

Fine-Tuning and Adaptationlayer 3 · tier 1
Scaling Lawslayer 4 · tier 1
Token Prediction and Language Modelinglayer 3 · tier 2
Cohere Modelslayer 4 · tier 2
Mixture of Expertslayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.