LLM Construction
Model Merging and Weight Averaging
Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.
Prerequisites
Why This Matters
Training a large model is expensive. If you have two models trained for different tasks (one for code, one for math), can you combine them into a single model that does both, without retraining from scratch? Model merging says yes, sometimes, by directly combining their weights.
This works because of a surprising property of neural network loss landscapes: models trained from the same initialization (or fine-tuned from the same base model) often lie in the same loss basin, connected by a path of low loss. Averaging their weights produces a model that lands in this basin and retains capabilities from both.
Model merging is cheap, fast, and requires no additional training data. It has become a standard tool for creating capable open-weight models by combining specialized fine-tunes.
Weight Averaging Basics
Weight Averaging
Given two models with weights and , the simplest merge is linear interpolation:
where controls the mixing ratio. When , this is a uniform average. This can be extended to models:
The obvious question: why would averaging weights produce a good model? Averaging predictions (ensembling) has theoretical justification through bias-variance decomposition. But averaging weights is different. Two models can have identical loss but very different weight configurations, and their average might have terrible loss.
Why Merging Works: Mode Connectivity
Linear Mode Connectivity for Fine-tuned Models
Statement
Let be a pretrained model, and let and be two models obtained by fine-tuning on different data or with different hyperparameters. Under typical fine-tuning conditions (moderate learning rate, limited epochs), the linear path between and :
has loss that remains close to the loss of the endpoints:
for small . That is, the linear interpolation does not pass through a high-loss barrier.
Intuition
Fine-tuning from a shared initialization makes small adjustments to the weights. Both fine-tuned models stay in the same "valley" of the loss landscape. The straight line between them stays in the valley. If the models had been trained from scratch with different random initializations, they would likely be in different valleys separated by high-loss barriers, and averaging would produce a model on the barrier.
Why It Matters
Mode connectivity is the theoretical justification for naive model merging. Without it, weight averaging would be unprincipled. The simplest sufficient condition is shared initialization: models must start from the same pretrained weights. This is why linear merging works well for fine-tuned variants of the same base model (e.g., two LoRA fine-tunes of Llama 3) and why it fails by default for independently trained models. Ainsworth et al. (2023, Git Re-Basin) show that independently trained SGD runs land in basins related by hidden-unit permutation symmetries: after permutation alignment, linear interpolation between the aligned models also stays in a low-loss region. Permutation alignment is therefore a second sufficient condition for merging.
Failure Mode
Naive linear mode connectivity breaks when fine-tuning is too aggressive (high learning rate, many epochs), pushing models into different basins. It also breaks when the two models are trained from different initializations and have not been aligned: independently pretrained models (e.g., Llama 3 and Mistral 7B) occupy different regions of weight space with permuted internal representations, so direct averaging produces poor results. They may still be mergeable in principle after permutation alignment, but only when the architectures match; models with different tokenizers, widths, or depths fall outside even the Git Re-Basin setting.
Stochastic Weight Averaging (SWA)
Stochastic Weight Averaging
Statement
Stochastic Weight Averaging (Izmailov et al., 2018) averages the weights visited by SGD during training:
where are the weights at the end of each training epoch (or at regular intervals). With a cyclic or sufficiently high constant learning rate, SGD explores the periphery of a flat loss basin. SWA moves toward the center of this basin, producing a solution with:
- Lower loss on validation data
- Broader minima (flatter curvature)
- Better calibration
Intuition
SGD with moderate learning rate bounces around a flat region of the loss landscape. Each checkpoint is near the edge of the basin. Averaging many checkpoints produces a point near the center. The center of a flat basin generalizes better because small perturbations to the weights (which correspond to changes in data distribution) cause smaller changes in loss.
Why It Matters
SWA is the simplest model merging method and requires no additional training or data. You just save checkpoints during a single training run and average them. It consistently improves generalization over the final SGD iterate and is nearly free in compute cost.
Failure Mode
SWA assumes the checkpoints are in the same basin. If the learning rate is too high and SGD jumps between basins, averaging can land on a high-loss barrier between them. SWA also requires the model to be in a low-loss region already; it does not help with convergence from a bad initialization.
SLERP: Spherical Linear Interpolation
SLERP for Model Weights
Spherical linear interpolation treats weight vectors as points on a hypersphere and interpolates along the great circle:
where is the angle between the weight vectors.
Domain restriction. The formula requires , i.e. and are neither parallel nor antiparallel. When (nearly parallel), and the formula is numerically unstable. In that regime, fall back to linear interpolation (LERP), which is the limit of SLERP as .
Norm preservation. SLERP preserves the norm only when (typically both are normalized to unit vectors, or both layers have been rescaled to a common norm). For general vectors with different norms, SLERP interpolates direction on the sphere but does not automatically preserve magnitude.
Why SLERP over linear interpolation? For unit-norm weight vectors, linear interpolation shrinks the result's norm: with equality only when the vectors are parallel. SLERP preserves unit norm along the sphere. For neural network weights where the scale of activations matters, this can produce better results than linear interpolation.
In practice, SLERP is applied layer-by-layer or parameter-group-by-group rather than to the full weight vector, and layers are often renormalized to matched norms before interpolation. It is the default merging method in community tools like mergekit.
Task Arithmetic
Task Arithmetic
Task Arithmetic (Ilharco et al., 2023) treats the difference between a fine-tuned model and its base as a portable "task vector":
Multiple task vectors can be composed linearly to add, subtract, or scale capabilities:
Positive adds a task capability, negative subtracts (forgets) it, and scaling controls strength. This is the algebraic scaffolding that TIES and DARE refine by first sparsifying or sign-resolving the .
Fisher-Weighted Merging and RegMean
Fisher-Weighted Merging
Matena and Raffel (2022) weight each parameter by the diagonal of the Fisher information matrix of model , which estimates how much the model's output distribution changes when that parameter is perturbed:
Parameters that matter more for a model's task (higher Fisher diagonal) get more weight in the merge. This improves over uniform averaging when the models disagree sharply about which parameters are important.
RegMean
Jin et al. (2023) cast merging as a closed-form least-squares problem per linear layer. For a linear layer with input activations having second-moment matrix on model 's task, the merged weight matrix solves:
This is "dataless" in the sense that the are precomputed statistics, not a training loop. RegMean outperforms Fisher-weighted merging when the input distributions across models differ substantially.
TIES-Merging
TIES-Merging
TIES-Merging (Yadav et al., 2023) addresses a problem with naive averaging: when merging models, task-specific weight changes from different models can cancel each other out if they point in opposite directions.
The TIES algorithm has three steps:
-
Trim: For each model, compute the task vector (the change from the base model). Keep the top-k% of entries by absolute magnitude (Yadav et al. use as the default) and zero out the rest. This is a per-model top-k selection, not a fixed threshold.
-
Elect sign: For each weight position, take a magnitude-weighted majority vote across models on whether the change should be positive or negative. This resolves sign conflicts.
-
Disjoint merge: For each position, average only the entries whose sign matches the elected sign. Positions with no matching entries are set to zero.
where is a scaling factor.
The key insight: naive averaging treats all weight changes equally. TIES recognizes that some changes are noise (small magnitude) and some conflict (opposite signs across models). By keeping only the top-k% most important entries per model and resolving sign conflicts before merging, TIES produces cleaner combinations.
DARE: Drop and Rescale
DARE
DARE (Yu et al., 2024) takes a more aggressive approach to sparsification before merging:
- Compute task vectors for each model.
- For each task vector, randomly drop a fraction of the entries (set them to zero). DARE uses high drop rates, typically .
- Rescale the remaining entries by to preserve the expected magnitude.
- Merge the sparsified task vectors by averaging.
where is a binary mask with entries drawn i.i.d. Bernoulli().
Why this works: DARE's premise is that fine-tuning task vectors are highly redundant. Most entries contribute little to the task-specific capability, so dropping 90-99% of them and rescaling preserves the expected contribution while sharply reducing interference between models during merging. Yu et al. show that drop rates up to retain most of the fine-tuned capability on language tasks.
DARE and TIES can be combined: apply DARE's random dropping, then TIES's sign election, then merge.
Evolutionary Model Merge
Evolutionary Model Merge
Akiba et al. (2024, Sakana AI) search for per-layer merge coefficients using evolutionary optimization (CMA-ES). The search space combines two axes:
- Parameter space (PS) merging: per-layer mixing weights for each layer , applied to task vectors or SLERP.
- Data-flow space (DFS) merging: a discrete per-layer routing that selects which source model's layer to use at each depth.
The fitness is a downstream-task evaluation on a held-out set. This removes the need to hand-tune and per-layer weights, at the cost of requiring evaluation compute during the search.
Aligning Independently Trained Models
Git Re-Basin
Ainsworth, Hayase, and Srinivasa (2023) formalize the permutation symmetry of neural networks: for a model with permutation applied to the hidden units of layer , there is an equivalent model obtained by permuting the rows of and the columns of by , giving identical input-output behaviour. Two independently trained models occupy weight-space points related by (approximately) such a permutation.
Weight matching and activation matching algorithms search for permutations that align to . After alignment, the linear interpolation stays in a low-loss region on vision benchmarks, recovering linear mode connectivity for independently trained models. Related follow-ups (ZipIt!, Stoica et al. 2023; Model Breadcrumbs, Davari and Belilovsky 2024) extend this idea to merging models without a shared base or to sparsifying task vectors for more robust composition.
Model Soups
Model Soups
"Model soups" (Wortsman et al., 2022) refers to averaging the weights of models trained with different hyperparameters (learning rate, weight decay, augmentation) on the same task. Instead of selecting the best model by validation performance, you average all models that exceed a quality threshold.
The result consistently outperforms the best individual model because:
- Each hyperparameter setting explores a slightly different part of the basin
- Averaging moves toward the center of the basin (similar to SWA)
- The averaged model is more robust to distribution shift than any individual model
Applications to LLM Merging
Model merging has become especially popular in the open-weight LLM community:
Combining specialized fine-tunes. Start with a base model (e.g., Llama 3 8B). Fine-tune separately for code, math, and conversation. Merge the three fine-tunes to get a model that does all three. This is much cheaper than training a single model on all three datasets jointly.
Community model development. Open-weight model communities on Hugging Face routinely merge models to combine capabilities. Tools like mergekit provide SLERP, TIES, DARE, and linear interpolation as merge strategies.
Avoiding catastrophic forgetting. Fine-tuning on a specialized dataset often degrades performance on general tasks (catastrophic forgetting). Merging the fine-tuned model with the base model (with the base model weighted more heavily) preserves general capability while adding specialized skill.
Common Confusions
Merging weights is not the same as ensembling predictions
Ensembling runs multiple models and combines their outputs (by averaging predictions or voting). This always works but requires running all models at inference time. Weight merging produces a single model. It is much cheaper at inference but only works when the models share a loss basin. These are structurally different operations.
Independently trained models are not unmergeable, they are unaligned
A common overclaim is that independently trained models simply cannot be merged. The correct statement is that independently trained SGD runs land in loss basins related by permutation symmetries of the hidden units: the same concept is encoded in different neuron indices across runs, so direct linear averaging mixes incompatible coordinates. Ainsworth, Hayase, and Srinivasa (2023, Git Re-Basin, ICLR) give algorithms (weight matching, activation matching) that recover the permutation aligning one model to another. After alignment, linear interpolation between the two aligned models stays in a low-loss region, so they can be merged. The blanket "cannot merge" claim applies only to naive linear merging without alignment, and only becomes fundamental when the architectures themselves differ (distinct tokenizers, widths, or depths).
SLERP is not always better than linear interpolation
SLERP preserves norm and interpolates along the sphere. For some models and tasks, linear interpolation works just as well or better. The theoretical advantage of SLERP (norm preservation) matters most when the models have similar norms and the interpolation path matters. In practice, try both.
Summary
- Weight averaging works because fine-tuned models share a loss basin (mode connectivity)
- Shared initialization is the simplest sufficient condition; independently trained models can still be merged after permutation alignment (Git Re-Basin)
- SWA: average checkpoints from a single run. Cheap, effective, moves toward basin center
- SLERP: spherical interpolation preserving unit norm when and . Default in community tools
- Task Arithmetic: merge as using task vectors
- Fisher-weighted merging and RegMean: weight parameter contributions by curvature or input covariance
- TIES-Merging: keep top-k% of task-vector entries, elect sign by magnitude-weighted majority, disjoint-merge agreements
- DARE: randomly drop 90-99% of task vector entries, rescale by , then merge
- Model soups: average models with different hyperparameters. Beats the best individual model
- Evolutionary Model Merge: search over per-layer merge coefficients via CMA-ES
- Applications: combining specialized LLM fine-tunes, reducing catastrophic forgetting
Exercises
Problem
Explain why averaging the weights of two models trained from the same base model is more likely to succeed than averaging the weights of two models trained from different random initializations. Use the concept of mode connectivity.
Problem
In TIES-Merging, the "trim" step keeps the top-k% of entries by magnitude in each task vector before merging. Explain: (a) why small-magnitude entries should be removed, (b) what happens if k is too small (too aggressive trim), and (c) what happens if k is too large (too lenient trim).
Problem
You have a base Llama 3 8B model and three specialized fine-tunes: one for Python code generation, one for mathematical reasoning, and one for biomedical text. Design a merging strategy that produces a single model performing well on all three domains. Specify the merging method, any hyperparameters, and how you would evaluate the result.
References
Canonical:
- Izmailov, Podoprikhin, Garipov, Vetrov, Wilson, "Averaging Weights Leads to Wider Optima and Better Generalization" (SWA, UAI 2018).
- Wortsman et al., "Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time" (ICML 2022).
- Frankle, Dziugaite, Roy, Carbin, "Linear Mode Connectivity and the Lottery Ticket Hypothesis" (ICML 2020). Mode connectivity theory.
- Matena and Raffel, "Merging Models with Fisher-Weighted Averaging" (NeurIPS 2022).
Alignment of independently trained models:
- Ainsworth, Hayase, Srinivasa, "Git Re-Basin: Merging Models modulo Permutation Symmetries" (ICLR 2023).
- Stoica, Bolya, Bjorner, Ramesh, Hearn, Hoffman, "ZipIt! Merging Models from Different Tasks without Training" (2023).
Task-vector methods:
- Ilharco, Ribeiro, Wortsman, Schmidt, Hajishirzi, Farhadi, "Editing Models with Task Arithmetic" (ICLR 2023).
- Yadav, Tam, Choshen, Raffel, Bansal, "TIES-Merging: Resolving Interference When Merging Models" (NeurIPS 2023). Top-k% trim with default.
- Yu, Yu, Yu, Huang, Li, "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE, ICML 2024, arXiv:2311.03099). Drop rates .
- Jin, Ren, Preotiuc-Pietro, Cheng, "Dataless Knowledge Fusion by Merging Weights of Language Models" (RegMean, ICLR 2023).
- Davari and Belilovsky, "Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks" (2024).
Search-based merging:
- Akiba, Shing, Tang, Sun, Ha, "Evolutionary Optimization of Model Merging Recipes" (Sakana AI, arXiv:2403.13187, 2024).
Next Topics
- Transformer architecture: the base architecture for models being merged
- Parameter-efficient fine-tuning: LoRA and other methods that produce mergeable adapters
Last reviewed: April 27, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Transformer Architecturelayer 4 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.