Paper breakdown
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu et al. · 2021 · ICLR 2022
Freezes the pretrained weight matrix W and learns a low-rank update BA at fine-tune time, where rank(BA) is typically 4-32 vs. thousands. Cuts trainable parameters by three to four orders of magnitude with no inference-time penalty.
Overview
Hu et al. (2021) observed that the weight updates produced by full fine-tuning of a pretrained transformer are approximately low rank. They proposed: instead of learning for the full , fix and learn where , , and . For typical attention projections in a 7B-parameter model, between 4 and 16 captures most of the fine-tune signal at the parameter count.
The architectural implication is small. The training implication is large: optimizer state for and fits on a consumer GPU even when does not. The deployment implication is also small in the right direction: at inference time the merged weight is a single matrix, so there is no extra forward-pass cost — the variant is functionally equivalent to a full fine-tune with the rank constraint.
LoRA became, alongside QLoRA (Dettmers et al., 2023), the standard recipe for adapting open-weight foundation models. By 2024 most public fine-tuning recipes are LoRA-flavored; HuggingFace's PEFT library is built around it. The paper is also a clean example of an empirical observation (low-rank fine-tune updates) translating directly into a method.
Mathematical Contributions
The decomposition
For a pretrained linear layer with weights , full fine-tuning replaces with where every entry of is trainable, requiring parameters. LoRA factors with , requiring parameters. The forward pass is:
with frozen and only trained. For attention layers in a 7B-parameter LLaMA, taking gives roughly of the original parameters as trainable.
Initialization
The paper initializes from a small random Gaussian and . Then at the start, so the network is exactly equivalent to the pretrained model on step 0. This is the LoRA equivalent of "start from a clean fine-tune" and is critical: any nonzero initialization for both factors would perturb the model before training.
Scaling factor
The full update is written:
where is a fixed scalar (typically chosen so that is constant as varies). The factor decouples the effective learning rate from the rank, so changing does not require re-tuning the optimizer. In practice is a common default; is also common.
Where to apply it
The paper applies LoRA to the query and value projections in attention layers and reports that adapting only those two is sufficient to match full fine-tune performance on the GLUE and instruction-following benchmarks they evaluate. Adding LoRA to and to MLP layers gives marginal improvement. The core experimental claim is that the attention projections are the right adaptation surface; the MLP is closer to a feature library that fine-tuning rarely needs to overwrite.
This claim does not survive cleanly into the 2023+ era of large-scale instruction-tuning and tool-use fine-tuning, where LoRA on all linear layers (including MLP) is the more common configuration. The original ablation was at small scale and on academic benchmarks.
The intrinsic-rank hypothesis
Section 4.2 measures the singular values of from a full fine-tune and finds they decay rapidly: the top 8 singular values capture of the Frobenius norm. This motivates the rank constraint as a structural prior, not just a parameter budget. The paper credits Aghajanyan, Zettlemoyer, and Gupta (2020) for the prior observation that fine-tuning operates in a low intrinsic dimension of the parameter space.
The hypothesis is a useful prior but not a theorem. There are tasks (notably some math reasoning and out-of-distribution adaptation) where higher rank or full fine-tune is measurably better. LoRA is a strong default, not a universal substitute.
The merge at inference
After training, is a single matrix and replaces in the deployed model. The forward pass at inference has no extra layer, no extra memory load, no extra FLOPs — it is bit-for-bit identical to a model with as its only weight. This is the property that distinguishes LoRA from prefix tuning, prompt tuning, adapters, and IA³, which all add inference-time cost.
Memory savings during training
The dominant memory cost in fine-tuning a large model is optimizer state — Adam stores two floats (first and second moment) per trainable parameter at fp32, so 8 bytes per param. A 7B model fully fine-tuned needs ~56 GB of optimizer state alone. With LoRA at on attention layers, the trainable count drops to ~, optimizer state to ~80 MB. The forward-pass activations still go through the full , but those are recomputable.
QLoRA (Dettmers et al., 2023) extends this by quantizing to 4-bit while keeping at fp16 or bfp16. The combination fits a 65B model fine-tune on a single 48 GB GPU.
Connections to TheoremPath Topics
- Fine-tuning and adaptation — the broader landscape of full fine-tune vs PEFT vs prompt-tuning.
- Transformer architecture — the architecture LoRA modifies.
- Attention mechanism theory — the projections that LoRA targets.
- Knowledge distillation — the alternative approach to specialization (smaller separate model rather than smaller delta).
Why It Matters Now
LoRA changed who can fine-tune. Before LoRA, adapting a 13B-parameter model required a multi-GPU node with ~150 GB of GPU memory; after LoRA + QLoRA, the same task fits on one consumer 24 GB card. This shifted research and product development away from the largest labs and made the open-weight ecosystem (LLaMA, Mistral, Qwen, DeepSeek, etc.) usable for downstream applications.
The structural prior — fine-tune updates are low rank — has held up across a wide range of tasks but is not universal. For tasks that genuinely require new capabilities (long-horizon math, tool use, in-context retrieval over very long contexts), full fine-tune or higher-rank LoRA outperforms small- LoRA. The 2024 picture is that LoRA at on all linear layers (attention and MLP) is the practical default; the original on only is no longer the recommended setting for instruction tuning.
The paper also planted the question of whether the pretrained model itself operates near a low-dimensional manifold of useful directions, which the fine-tune just tilts along. Some interpretability work (Anthropic's superposition line, sparse-autoencoder feature decomposition) is a related angle on the same observation: the active dimension of the model is much smaller than the parameter count.
References
Canonical:
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. arXiv:2106.09685.
Direct precursors:
- Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL. arXiv:2012.13255. The empirical observation that fine-tuning operates in low intrinsic dimension.
- Houlsby, N. et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML. arXiv:1902.00751. Adapter modules — the prior PEFT method.
- Li, X. L., & Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL. arXiv:2101.00190. Trainable virtual prompt tokens.
Direct descendants:
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS. arXiv:2305.14314. 4-bit quantized base + LoRA adapter; fits 65B fine-tune on one GPU.
- Liu, S.-Y. et al. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML. arXiv:2402.09353. Decomposes into magnitude and direction; LoRAs only the direction.
- Hayou, S., Ghosh, N., & Yu, B. (2024). "LoRA+: Efficient Low Rank Adaptation of Large Models." ICML. arXiv:2402.12354. Sets a different learning rate for and .
Empirical evaluations:
- Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. (2022). "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning." NeurIPS. arXiv:2205.05638. Compares PEFT methods including LoRA at scale.
- Biderman, D. et al. (2024). "LoRA Learns Less and Forgets Less." arXiv:2405.09673. LoRA preserves base capabilities better than full fine-tune.
Standard textbook:
- Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 9.5 — fine-tuning and adapters.
Connected topics
Last reviewed: May 5, 2026