Federated Learning

Sneiderman, Robby

Methodology

Federated Learning

Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.

AdvancedTier 2CurrentSupporting~50 min

Prerequisites

Optimizer Theory SGD Adam Muon

Quiz (7)Pulse Check Prereq Map

Why This Matters

In many real applications, data cannot be centralized. Hospitals cannot share patient records. Phone keyboards should not upload everything you type to a server. Companies in competing jurisdictions cannot pool their datasets. Federated learning trains a shared model while the data stays on each client device.

The core idea is simple: send the model to the data, not the data to the model. Each client trains locally and sends back model updates (gradients or weight deltas). A central server aggregates these updates. No raw data leaves the client.

Two distinctions matter immediately. First, federated learning is a systems / optimization setup, not automatically a privacy guarantee. Second, there are two major deployment regimes: cross-device FL (millions of phones or edge devices, each with small local datasets and unreliable participation) and cross-silo FL (a small number of organizations such as hospitals or banks, each with larger datasets and more stable connectivity). The mathematics of FedAvg starts the same, but the threat model and communication regime differ.

Problem Setup

Definition

Federated Learning Objective

There are $K$ clients, each with local dataset $\mathcal{D}_k$ . The global objective is:

$\min_\theta F(\theta) = \sum_{k=1}^{K} \frac{n_k}{n} F_k(\theta), \quad F_k(\theta) = \frac{1}{n_k} \sum_{(x,y) \in \mathcal{D}_k} \ell(f_\theta(x), y)$

where $n_k = |\mathcal{D}_k|$ and $n = \sum_k n_k$ . This is a weighted average of local losses, where each client's weight is proportional to its data size.

Definition

Communication Round

One communication round consists of: (1) the server sends the current global model $\theta^t$ to a subset of clients, (2) each selected client trains locally for $E$ epochs starting from $\theta^t$ , (3) clients send updated parameters $\theta_k^{t+1}$ back to the server, (4) the server aggregates updates to form $\theta^{t+1}$ .

If $S_t$ is the selected client set in round $t$ and $n_t = \sum_{k \in S_t} n_k$ , the standard weighted aggregation is

$\theta^{t+1} = \sum_{k \in S_t} \frac{n_k}{n_t}\,\theta_k^{t+1}.$

The normalization is over the participating clients in that round, not all clients in the federation.

FedAvg: Federated Averaging

The FedAvg algorithm (McMahan et al., 2017):

Server initializes $\theta^0$
For each round $t = 0, 1, 2, \ldots$ $t = 0, 1, 2, \dots$ :
- Server selects a fraction $C$ of clients (typically $C = 0.1$ )
- Each selected client $k$ runs SGD for $E$ local epochs on $\mathcal{D}_k$ , starting from $\theta^t$
- Client sends $\theta_k^{t+1}$ to the server
- Server computes $\theta^{t+1} = \sum_{k \in S_t} \frac{n_k}{n_t} \theta_k^{t+1}$

Proposition

FedAvg Equals Ordinary Gradient Descent in the Homogeneous Regime

Statement

If every participating client has the same objective $F_k = F$ and runs the same deterministic full-batch gradient-descent update from the same starting point $\theta^t$ , then all local iterates remain identical throughout the round. Consequently the FedAvg aggregate after $E$ local steps is exactly the same iterate $\theta^{(E)}$ that ordinary gradient descent would produce after $E$ steps on $F$ from $\theta^t$ .

Intuition

If all clients are really optimizing the same objective, there is nothing to "federate away." Local steps do not create disagreement, so averaging is mathematically inert. FedAvg is then just ordinary optimization with delayed communication.

Proof Sketch

Induct on the local-step index $e$ . At $e = 0$ all clients start from the same $\theta^t$ . If they are equal at step $e$ , then they apply the same update rule using the same objective and learning rate, so the step- $e+1$ iterates are also equal. The aggregate of identical vectors is that same vector.

Why It Matters

This isolates the true difficulty in federated learning: heterogeneity. When clients are homogeneous, local computation is harmless. When clients have different objectives, the local steps pull in different directions and averaging becomes approximate rather than exact.

Failure Mode

Real federated learning is almost never in this regime. Client sampling is partial, local optimization is stochastic, and the local objectives differ because user data are heterogeneous. Those effects are exactly what create client drift.

report a correction →

Heterogeneity and Client Drift

In practice, client data is almost never IID. A hospital in rural area sees different conditions than one in a city. Your phone keyboard reflects your vocabulary, not the population's.

One common way to quantify client mismatch is the gradient heterogeneity

\Gamma(\theta) = \sum_{k=1}^{K} p_k \left\|\nabla F_k(\theta) - \nabla F(\theta)\right\|^2, \qquad p_k = \frac{n_k}{n}.

If $\Gamma(\theta)=0$ , all client gradients agree with the global gradient at that parameter. Large $\Gamma(\theta)$ means local training directions differ substantially across clients.

Proposition

Heterogeneity Amplifies with Local Steps

Statement

In representative convergence bounds for FedAvg / local-SGD methods, the optimization error can be decomposed schematically as

\text{centralized-SGD error} \;+\; \text{stochastic-gradient variance} \;+\; O(\eta^2 E^2 \Gamma),

up to problem-dependent smoothness and participation constants. The key structural fact is that the heterogeneity penalty worsens with both the number of local steps $E$ and the gradient mismatch $\Gamma$ .

Intuition

Every local step follows $\nabla F_k$ , not $\nabla F$ . If clients disagree about the right direction, then taking more local steps before averaging lets that disagreement accumulate. Communication decreases, but bias from client drift increases.

Proof Sketch

Compare one round of local training on each $F_k$ with the corresponding centralized update on $F = \sum_k p_k F_k$ . Smoothness lets you control how gradient mismatch at one local step perturbs the next; summing these perturbations over $E$ local steps yields an error term whose dominant dependence grows with both $E$ and the heterogeneity level. This is the structural reason methods such as FedProx and SCAFFOLD help: they reduce the effective drift term.

Why It Matters

This is the real communication tradeoff in FL. On nearly homogeneous clients, several local epochs save bandwidth with little pain. On heavily non-IID data, the same choice of $E$ can destabilize training or degrade the eventual model, which is why FL systems almost always tune $E$ , sampling rate, and drift corrections together.

Failure Mode

The exact constants and even the sharp dependence on $E$ vary with the assumptions: convex vs nonconvex objectives, partial participation, bounded dissimilarity models, and whether the algorithm uses correction terms. The point of the proposition is not the exact prefactor; it is the mechanism that larger local training amplifies heterogeneity-induced bias.

report a correction →

Communication Efficiency

Communication is the bottleneck in federated learning. Sending full model updates over mobile networks is slow and expensive.

Gradient compression: quantize gradients to lower precision (e.g., 1-bit SGD) or sparsify them (send only the top- $k$ gradient entries). SignSGD sends only the sign of each gradient entry.

FedAvg itself is a communication reduction: by doing $E$ local epochs per round, you reduce the number of communication rounds by a factor of $E$ compared to distributed SGD.

Model distillation: instead of sending full model updates, clients send logits or soft labels on a shared public dataset. This compresses the communication to $O(\text{dataset size} \times \text{classes})$ instead of $O(\text{parameters})$ .

Secure Aggregation and Privacy

Definition

Secure Aggregation

Secure aggregation is a cryptographic protocol that lets the server learn only an aggregate such as $\sum_{k \in S_t} \Delta_k$ , not the individual client updates $\Delta_k$ themselves. Bonawitz et al. (2017) is the canonical cross-device construction.

Secure aggregation and differential privacy solve different problems:

Secure aggregation hides individual updates from the server and other clients during a round.
Differential privacy bounds how much the aggregate output changes when a client (or one example) is added or removed.

Secure aggregation protects the transport layer of training; differential privacy protects against inference from the released aggregate or final model.

Federated Learning + Differential Privacy

Federated learning does not guarantee privacy by itself. Model updates can leak information about the training data (via gradient inversion attacks). Adding differential privacy provides formal guarantees.

In cross-device FL, the most common guarantee is user-level DP: neighboring datasets differ by the presence or absence of one client's entire local dataset. User-level DP is the stronger notion because one neighboring change can add or remove an entire client's data, not a single record. It is therefore usually stricter and more expensive than example-level DP, which only protects a single training example.

One standard user-level DP pattern is: clip each client's update to norm $C$ , aggregate the clipped updates, then add Gaussian noise calibrated to the aggregate sensitivity:

\tilde{\Delta} = \frac{1}{|S_t|}\left(\sum_{k \in S_t} \operatorname{clip}(\Delta_k, C) + \mathcal{N}(0, \sigma^2 C^2 I)\right).

In deployed systems, the noise can be added centrally or in a distributed way on top of secure aggregation. The resulting mechanism provides $(\varepsilon, \delta)$ -DP with respect to the chosen neighboring relation (typically one user / one client).

The cost: added noise reduces convergence speed. More privacy (smaller $\epsilon$ ) means more noise and slower convergence.

Common Confusions

Watch Out

Federated learning does not guarantee privacy

Raw model updates can reveal training data. Gradient inversion attacks can reconstruct input images from gradients with high fidelity. Federated learning must be combined with differential privacy, secure aggregation, or both to provide meaningful privacy guarantees.

Watch Out

Secure aggregation is not the same thing as differential privacy

Secure aggregation stops the server from reading individual updates in a round, but it does not by itself bound what can be learned from the aggregate or the final trained model. Differential privacy gives a formal participation guarantee; secure aggregation gives a cryptographic visibility guarantee. In practice, many systems want both.

Watch Out

FedAvg is not the same as distributed SGD

In distributed SGD, workers compute gradients on different data shards and average them every step. In FedAvg, clients run multiple local SGD steps before averaging. This distinction matters: FedAvg introduces client drift, which distributed SGD does not have (because it averages every step).

Watch Out

Cross-device and cross-silo federated learning are different regimes

Cross-device FL has massive client populations, partial participation, device dropout, and user-level privacy concerns. Cross-silo FL has far fewer parties, larger local datasets, and often stronger legal or contractual governance. Calling both simply "federated learning" is fine at a high level, but system design choices that make sense in one regime can be wrong in the other.

Canonical Examples

Example

Mobile keyboard next-word prediction

Google's Gboard uses federated learning for next-word prediction. Each phone trains locally on the user's typing data. Model updates are sent to a server during idle charging time. The server averages updates across millions of devices. No individual keystrokes leave the phone. The model improves globally while each user's data remains local.

Summary

Federated learning sends the model to the data, not data to the model
FedAvg: selected clients do local optimization, then the server averages their updates weighted by participating data size
In the homogeneous limit, FedAvg reduces to ordinary optimization; the real difficulty is heterogeneity
Non-IID data creates client drift, and more local steps amplify that mismatch
Communication efficiency comes from local computation, gradient compression, or distillation
Privacy requires additional mechanisms: secure aggregation, differential privacy, or both
Non-IID data is the primary open challenge

Exercises

ExerciseCore

Problem

You have 100 clients, each with 1000 examples. You run FedAvg with $C = 0.1$ (selecting 10% of clients per round) and $E = 5$ local epochs. How many total gradient steps are performed across all clients in one communication round?

ExerciseAdvanced

Problem

Client A has data that is entirely class 0. Client B has data that is entirely class 1. Both start from the same initial model. After one round of FedAvg with $E = 10$ local epochs, the averaged model performs poorly on both classes. Explain why, and propose a modification to FedAvg that would help.

References

Canonical:

McMahan, Moore, Ramage, Hampson, and Aguera y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data (AISTATS 2017). The FedAvg paper.
Bonawitz et al., Practical Secure Aggregation for Privacy-Preserving Machine Learning (CCS 2017). The canonical secure-aggregation reference for cross-device FL.
Kairouz et al., Advances and Open Problems in Federated Learning (Foundations and Trends in Machine Learning 14, 2021), Sections 1-4. The standard survey reference.

Current:

Li, Sahu, Talwalkar, and Smith, Federated Optimization in Heterogeneous Networks (MLSys 2020). FedProx and the bounded-dissimilarity framing.
Karimireddy et al., SCAFFOLD: Stochastic Controlled Averaging for Federated Learning (ICML 2020). Control variates for client-drift correction.
McMahan et al., Learning Differentially Private Recurrent Language Models (ICLR 2018). Early user-level DP in a federated-style mobile setting.

Next Topics

Differential Privacy: formal privacy guarantees layered on top of federated optimization.
Distributed Training Theory: the broader communication / optimization toolkit that FL draws on and departs from.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1

Derived topics

2

Differential Privacylayer 3 · tier 2
Distributed Training Theorylayer 5 · tier 3

Graph-backed continuations

Differential Privacy Distributed Training Theory