Model Merging and Weight Averaging

Sneiderman, Robby

LLM Construction

Model Merging and Weight Averaging

Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.

AdvancedTier 3FrontierFrontier watch~45 min

Prerequisites

Transformer Architecture

Prereq Map

Why This Matters

Training a large model is expensive. If you have two models trained for different tasks (one for code, one for math), can you combine them into a single model that does both, without retraining from scratch? Model merging says yes, sometimes, by directly combining their weights.

This works because of a surprising property of neural network loss landscapes: models trained from the same initialization (or fine-tuned from the same base model) often lie in the same loss basin, connected by a path of low loss. Averaging their weights produces a model that lands in this basin and retains capabilities from both.

Model merging is cheap, fast, and requires no additional training data. It has become a standard tool for creating capable open-weight models by combining specialized fine-tunes.

Weight Averaging Basics

Definition

Weight Averaging

Given two models with weights $\theta_1$ and $\theta_2$ , the simplest merge is linear interpolation:

$\theta_{\text{merged}} = \alpha \theta_1 + (1 - \alpha) \theta_2$

where $\alpha \in [0, 1]$ controls the mixing ratio. When $\alpha = 0.5$ , this is a uniform average. This can be extended to $k$ models:

$\theta_{\text{merged}} = \sum_{i=1}^{k} \alpha_i \theta_i, \quad \sum_i \alpha_i = 1$

The obvious question: why would averaging weights produce a good model? Averaging predictions (ensembling) has theoretical justification through bias-variance decomposition. But averaging weights is different. Two models can have identical loss but very different weight configurations, and their average might have terrible loss.

Why Merging Works: Mode Connectivity

Proposition

Linear Mode Connectivity for Fine-tuned Models

Statement

Let $\theta_0$ be a pretrained model, and let $\theta_1$ and $\theta_2$ be two models obtained by fine-tuning $\theta_0$ on different data or with different hyperparameters. Under typical fine-tuning conditions (moderate learning rate, limited epochs), the linear path between $\theta_1$ and $\theta_2$ :

$\theta(\alpha) = \alpha \theta_1 + (1 - \alpha) \theta_2, \quad \alpha \in [0, 1]$

has loss that remains close to the loss of the endpoints:

$L(\theta(\alpha)) \lesssim \max(L(\theta_1), L(\theta_2)) + \epsilon$

for small $\epsilon$ . That is, the linear interpolation does not pass through a high-loss barrier.

Intuition

Fine-tuning from a shared initialization makes small adjustments to the weights. Both fine-tuned models stay in the same "valley" of the loss landscape. The straight line between them stays in the valley. If the models had been trained from scratch with different random initializations, they would likely be in different valleys separated by high-loss barriers, and averaging would produce a model on the barrier.

Why It Matters

Mode connectivity is the theoretical justification for naive model merging. Without it, weight averaging would be unprincipled. The simplest sufficient condition is shared initialization: models must start from the same pretrained weights. This is why linear merging works well for fine-tuned variants of the same base model (e.g., two LoRA fine-tunes of Llama 3) and why it fails by default for independently trained models. Ainsworth et al. (2023, Git Re-Basin) show that independently trained SGD runs land in basins related by hidden-unit permutation symmetries: after permutation alignment, linear interpolation between the aligned models also stays in a low-loss region. Permutation alignment is therefore a second sufficient condition for merging.

Failure Mode

Naive linear mode connectivity breaks when fine-tuning is too aggressive (high learning rate, many epochs), pushing models into different basins. It also breaks when the two models are trained from different initializations and have not been aligned: independently pretrained models (e.g., Llama 3 and Mistral 7B) occupy different regions of weight space with permuted internal representations, so direct averaging produces poor results. They may still be mergeable in principle after permutation alignment, but only when the architectures match; models with different tokenizers, widths, or depths fall outside even the Git Re-Basin setting.

report a correction →

Stochastic Weight Averaging (SWA)

Proposition

Stochastic Weight Averaging

Statement

Stochastic Weight Averaging (Izmailov et al., 2018) averages the weights visited by SGD during training:

$\theta_{\text{SWA}} = \frac{1}{T} \sum_{t=1}^{T} \theta_t$

where $\theta_t$ are the weights at the end of each training epoch (or at regular intervals). With a cyclic or sufficiently high constant learning rate, SGD explores the periphery of a flat loss basin. SWA moves toward the center of this basin, producing a solution with:

Lower loss on validation data
Broader minima (flatter curvature)
Better calibration

Intuition

SGD with moderate learning rate bounces around a flat region of the loss landscape. Each checkpoint is near the edge of the basin. Averaging many checkpoints produces a point near the center. The center of a flat basin generalizes better because small perturbations to the weights (which correspond to changes in data distribution) cause smaller changes in loss.

Why It Matters

SWA is the simplest model merging method and requires no additional training or data. You just save checkpoints during a single training run and average them. It consistently improves generalization over the final SGD iterate and is nearly free in compute cost.

Failure Mode

SWA assumes the checkpoints are in the same basin. If the learning rate is too high and SGD jumps between basins, averaging can land on a high-loss barrier between them. SWA also requires the model to be in a low-loss region already; it does not help with convergence from a bad initialization.

report a correction →

SLERP: Spherical Linear Interpolation

Definition

SLERP for Model Weights

Spherical linear interpolation treats weight vectors as points on a hypersphere and interpolates along the great circle:

$\text{SLERP}(\theta_1, \theta_2, \alpha) = \frac{\sin((1-\alpha)\Omega)}{\sin\Omega} \theta_1 + \frac{\sin(\alpha\Omega)}{\sin\Omega} \theta_2$

where $\Omega = \arccos\left(\frac{\theta_1 \cdot \theta_2}{\|\theta_1\| \|\theta_2\|}\right)$ is the angle between the weight vectors.

Domain restriction. The formula requires $\Omega \in (0, \pi)$ , i.e. $\theta_1$ and $\theta_2$ are neither parallel nor antiparallel. When $\Omega \to 0$ (nearly parallel), $\sin\Omega \to 0$ and the formula is numerically unstable. In that regime, fall back to linear interpolation (LERP), which is the limit of SLERP as $\Omega \to 0$ .

Norm preservation. SLERP preserves the norm only when $\|\theta_1\| = \|\theta_2\|$ (typically both are normalized to unit vectors, or both layers have been rescaled to a common norm). For general vectors with different norms, SLERP interpolates direction on the sphere but does not automatically preserve magnitude.

Why SLERP over linear interpolation? For unit-norm weight vectors, linear interpolation shrinks the result's norm: $\|\alpha \theta_1 + (1-\alpha)\theta_2\| \leq 1$ with equality only when the vectors are parallel. SLERP preserves unit norm along the sphere. For neural network weights where the scale of activations matters, this can produce better results than linear interpolation.

In practice, SLERP is applied layer-by-layer or parameter-group-by-group rather than to the full weight vector, and layers are often renormalized to matched norms before interpolation. It is the default merging method in community tools like mergekit.

Task Arithmetic

Definition

Task Arithmetic

Task Arithmetic (Ilharco et al., 2023) treats the difference between a fine-tuned model and its base as a portable "task vector":

$\tau_i = \theta_i - \theta_0$

Multiple task vectors can be composed linearly to add, subtract, or scale capabilities:

$\theta_{\text{merged}} = \theta_0 + \sum_i \lambda_i \tau_i$

Positive $\lambda_i$ adds a task capability, negative $\lambda_i$ subtracts (forgets) it, and scaling $\lambda_i$ controls strength. This is the algebraic scaffolding that TIES and DARE refine by first sparsifying or sign-resolving the $\tau_i$ .

Fisher-Weighted Merging and RegMean

Definition

Fisher-Weighted Merging

Matena and Raffel (2022) weight each parameter by the diagonal of the Fisher information matrix $F_i$ of model $i$ , which estimates how much the model's output distribution changes when that parameter is perturbed:

$\theta_{\text{merged}} = \frac{\sum_i F_i \odot \theta_i}{\sum_i F_i}$

Parameters that matter more for a model's task (higher Fisher diagonal) get more weight in the merge. This improves over uniform averaging when the models disagree sharply about which parameters are important.

Definition

RegMean

Jin et al. (2023) cast merging as a closed-form least-squares problem per linear layer. For a linear layer with input activations having second-moment matrix $G_i = \mathbb{E}[x x^\top]$ on model $i$ 's task, the merged weight matrix solves:

$W_{\text{merged}} = \left(\sum_i G_i\right)^{-1} \sum_i G_i W_i$

This is "dataless" in the sense that the $G_i$ are precomputed statistics, not a training loop. RegMean outperforms Fisher-weighted merging when the input distributions across models differ substantially.

TIES-Merging

Definition

TIES-Merging

TIES-Merging (Yadav et al., 2023) addresses a problem with naive averaging: when merging $k$ models, task-specific weight changes from different models can cancel each other out if they point in opposite directions.

The TIES algorithm has three steps:

Trim: For each model, compute the task vector $\tau_i = \theta_i - \theta_0$ (the change from the base model). Keep the top-k% of entries by absolute magnitude (Yadav et al. use $k = 20\%$ as the default) and zero out the rest. This is a per-model top-k selection, not a fixed threshold.
Elect sign: For each weight position, take a magnitude-weighted majority vote across models on whether the change should be positive or negative. This resolves sign conflicts.
Disjoint merge: For each position, average only the entries whose sign matches the elected sign. Positions with no matching entries are set to zero.

$\theta_{\text{TIES}} = \theta_0 + \lambda \cdot \text{disjoint\_merge}(\tau_1, \ldots, \tau_k)$

where $\lambda$ is a scaling factor.

The key insight: naive averaging treats all weight changes equally. TIES recognizes that some changes are noise (small magnitude) and some conflict (opposite signs across models). By keeping only the top-k% most important entries per model and resolving sign conflicts before merging, TIES produces cleaner combinations.

DARE: Drop and Rescale

Definition

DARE

DARE (Yu et al., 2024) takes a more aggressive approach to sparsification before merging:

Compute task vectors $\tau_i = \theta_i - \theta_0$ for each model.
For each task vector, randomly drop a fraction $p$ of the entries (set them to zero). DARE uses high drop rates, typically $p \in [0.9, 0.99]$ .
Rescale the remaining entries by $1/(1-p)$ to preserve the expected magnitude.
Merge the sparsified task vectors by averaging.

$\tau_i^{\text{DARE}} = \frac{1}{1-p} \cdot m_i \odot \tau_i$

where $m_i$ is a binary mask with entries drawn i.i.d. Bernoulli( $1-p$ ).

Why this works: DARE's premise is that fine-tuning task vectors are highly redundant. Most entries contribute little to the task-specific capability, so dropping 90-99% of them and rescaling preserves the expected contribution while sharply reducing interference between models during merging. Yu et al. show that drop rates up to $p = 0.99$ retain most of the fine-tuned capability on language tasks.

DARE and TIES can be combined: apply DARE's random dropping, then TIES's sign election, then merge.

Evolutionary Model Merge

Definition

Evolutionary Model Merge

Akiba et al. (2024, Sakana AI) search for per-layer merge coefficients using evolutionary optimization (CMA-ES). The search space combines two axes:

Parameter space (PS) merging: per-layer mixing weights $\alpha_{\ell}$ for each layer $\ell$ , applied to task vectors or SLERP.
Data-flow space (DFS) merging: a discrete per-layer routing that selects which source model's layer to use at each depth.

The fitness is a downstream-task evaluation on a held-out set. This removes the need to hand-tune $\lambda$ and per-layer weights, at the cost of requiring evaluation compute during the search.

Aligning Independently Trained Models

Definition

Git Re-Basin

Ainsworth, Hayase, and Srinivasa (2023) formalize the permutation symmetry of neural networks: for a model with permutation $\pi_{\ell}$ applied to the hidden units of layer $\ell$ , there is an equivalent model obtained by permuting the rows of $W_{\ell}$ and the columns of $W_{\ell+1}$ by $\pi_{\ell}$ , giving identical input-output behaviour. Two independently trained models $\theta_1, \theta_2$ occupy weight-space points related by (approximately) such a permutation.

Weight matching and activation matching algorithms search for permutations $\{\pi_{\ell}\}$ that align $\theta_2$ to $\theta_1$ . After alignment, the linear interpolation $\alpha \theta_1 + (1-\alpha) \pi(\theta_2)$ stays in a low-loss region on vision benchmarks, recovering linear mode connectivity for independently trained models. Related follow-ups (ZipIt!, Stoica et al. 2023; Model Breadcrumbs, Davari and Belilovsky 2024) extend this idea to merging models without a shared base or to sparsifying task vectors for more robust composition.

Model Soups

Definition

Model Soups

"Model soups" (Wortsman et al., 2022) refers to averaging the weights of models trained with different hyperparameters (learning rate, weight decay, augmentation) on the same task. Instead of selecting the best model by validation performance, you average all models that exceed a quality threshold.

The result consistently outperforms the best individual model because:

Each hyperparameter setting explores a slightly different part of the basin
Averaging moves toward the center of the basin (similar to SWA)
The averaged model is more robust to distribution shift than any individual model

Applications to LLM Merging

Model merging has become especially popular in the open-weight LLM community:

Combining specialized fine-tunes. Start with a base model (e.g., Llama 3 8B). Fine-tune separately for code, math, and conversation. Merge the three fine-tunes to get a model that does all three. This is much cheaper than training a single model on all three datasets jointly.

Community model development. Open-weight model communities on Hugging Face routinely merge models to combine capabilities. Tools like mergekit provide SLERP, TIES, DARE, and linear interpolation as merge strategies.

Avoiding catastrophic forgetting. Fine-tuning on a specialized dataset often degrades performance on general tasks (catastrophic forgetting). Merging the fine-tuned model with the base model (with the base model weighted more heavily) preserves general capability while adding specialized skill.

Common Confusions

Watch Out

Merging weights is not the same as ensembling predictions

Ensembling runs multiple models and combines their outputs (by averaging predictions or voting). This always works but requires running all models at inference time. Weight merging produces a single model. It is much cheaper at inference but only works when the models share a loss basin. These are structurally different operations.

Watch Out

Independently trained models are not unmergeable, they are unaligned

A common overclaim is that independently trained models simply cannot be merged. The correct statement is that independently trained SGD runs land in loss basins related by permutation symmetries of the hidden units: the same concept is encoded in different neuron indices across runs, so direct linear averaging mixes incompatible coordinates. Ainsworth, Hayase, and Srinivasa (2023, Git Re-Basin, ICLR) give algorithms (weight matching, activation matching) that recover the permutation aligning one model to another. After alignment, linear interpolation between the two aligned models stays in a low-loss region, so they can be merged. The blanket "cannot merge" claim applies only to naive linear merging without alignment, and only becomes fundamental when the architectures themselves differ (distinct tokenizers, widths, or depths).

Watch Out

SLERP is not always better than linear interpolation

SLERP preserves norm and interpolates along the sphere. For some models and tasks, linear interpolation works just as well or better. The theoretical advantage of SLERP (norm preservation) matters most when the models have similar norms and the interpolation path matters. In practice, try both.

Summary

Weight averaging works because fine-tuned models share a loss basin (mode connectivity)
Shared initialization is the simplest sufficient condition; independently trained models can still be merged after permutation alignment (Git Re-Basin)
SWA: average checkpoints from a single run. Cheap, effective, moves toward basin center
SLERP: spherical interpolation preserving unit norm when $\|\theta_1\|=\|\theta_2\|$ and $\Omega \in (0, \pi)$ . Default in community tools
Task Arithmetic: merge as $\theta_0 + \sum_i \lambda_i \tau_i$ using task vectors $\tau_i = \theta_i - \theta_0$
Fisher-weighted merging and RegMean: weight parameter contributions by curvature or input covariance
TIES-Merging: keep top-k% of task-vector entries, elect sign by magnitude-weighted majority, disjoint-merge agreements
DARE: randomly drop 90-99% of task vector entries, rescale by $1/(1-p)$ , then merge
Model soups: average models with different hyperparameters. Beats the best individual model
Evolutionary Model Merge: search over per-layer merge coefficients via CMA-ES
Applications: combining specialized LLM fine-tunes, reducing catastrophic forgetting

Exercises

ExerciseCore

Problem

Explain why averaging the weights of two models trained from the same base model is more likely to succeed than averaging the weights of two models trained from different random initializations. Use the concept of mode connectivity.

ExerciseAdvanced

Problem

In TIES-Merging, the "trim" step keeps the top-k% of entries by magnitude in each task vector before merging. Explain: (a) why small-magnitude entries should be removed, (b) what happens if k is too small (too aggressive trim), and (c) what happens if k is too large (too lenient trim).

ExerciseResearch

Problem

You have a base Llama 3 8B model and three specialized fine-tunes: one for Python code generation, one for mathematical reasoning, and one for biomedical text. Design a merging strategy that produces a single model performing well on all three domains. Specify the merging method, any hyperparameters, and how you would evaluate the result.

References

Canonical:

Izmailov, Podoprikhin, Garipov, Vetrov, Wilson, "Averaging Weights Leads to Wider Optima and Better Generalization" (SWA, UAI 2018).
Wortsman et al., "Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time" (ICML 2022).
Frankle, Dziugaite, Roy, Carbin, "Linear Mode Connectivity and the Lottery Ticket Hypothesis" (ICML 2020). Mode connectivity theory.
Matena and Raffel, "Merging Models with Fisher-Weighted Averaging" (NeurIPS 2022).

Alignment of independently trained models:

Ainsworth, Hayase, Srinivasa, "Git Re-Basin: Merging Models modulo Permutation Symmetries" (ICLR 2023).
Stoica, Bolya, Bjorner, Ramesh, Hearn, Hoffman, "ZipIt! Merging Models from Different Tasks without Training" (2023).

Task-vector methods:

Ilharco, Ribeiro, Wortsman, Schmidt, Hajishirzi, Farhadi, "Editing Models with Task Arithmetic" (ICLR 2023).
Yadav, Tam, Choshen, Raffel, Bansal, "TIES-Merging: Resolving Interference When Merging Models" (NeurIPS 2023). Top-k% trim with $k=20\%$ default.
Yu, Yu, Yu, Huang, Li, "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE, ICML 2024, arXiv:2311.03099). Drop rates $p \in [0.9, 0.99]$ .
Jin, Ren, Preotiuc-Pietro, Cheng, "Dataless Knowledge Fusion by Merging Weights of Language Models" (RegMean, ICLR 2023).
Davari and Belilovsky, "Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks" (2024).

Search-based merging:

Akiba, Shing, Tang, Sun, Ha, "Evolutionary Optimization of Model Merging Recipes" (Sakana AI, arXiv:2403.13187, 2024).

Next Topics

Transformer architecture: the base architecture for models being merged
Parameter-efficient fine-tuning: LoRA and other methods that produce mergeable adapters

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Transformer Architecturelayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.