Methodology
Federated Learning
Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.
Prerequisites
Why This Matters
In many real applications, data cannot be centralized. Hospitals cannot share patient records. Phone keyboards should not upload everything you type to a server. Companies in competing jurisdictions cannot pool their datasets. Federated learning trains a shared model while the data stays on each client device.
The core idea is simple: send the model to the data, not the data to the model. Each client trains locally and sends back model updates (gradients or weight deltas). A central server aggregates these updates. No raw data leaves the client.
Two distinctions matter immediately. First, federated learning is a systems / optimization setup, not automatically a privacy guarantee. Second, there are two major deployment regimes: cross-device FL (millions of phones or edge devices, each with small local datasets and unreliable participation) and cross-silo FL (a small number of organizations such as hospitals or banks, each with larger datasets and more stable connectivity). The mathematics of FedAvg starts the same, but the threat model and communication regime differ.
Problem Setup
Federated Learning Objective
There are clients, each with local dataset . The global objective is:
where and . This is a weighted average of local losses, where each client's weight is proportional to its data size.
Communication Round
One communication round consists of: (1) the server sends the current global model to a subset of clients, (2) each selected client trains locally for epochs starting from , (3) clients send updated parameters back to the server, (4) the server aggregates updates to form .
If is the selected client set in round and , the standard weighted aggregation is
The normalization is over the participating clients in that round, not all clients in the federation.
FedAvg: Federated Averaging
The FedAvg algorithm (McMahan et al., 2017):
- Server initializes
- For each round :
- Server selects a fraction of clients (typically )
- Each selected client runs SGD for local epochs on , starting from
- Client sends to the server
- Server computes
FedAvg Equals Ordinary Gradient Descent in the Homogeneous Regime
Statement
If every participating client has the same objective and runs the same deterministic full-batch gradient-descent update from the same starting point , then all local iterates remain identical throughout the round. Consequently the FedAvg aggregate after local steps is exactly the same iterate that ordinary gradient descent would produce after steps on from .
Intuition
If all clients are really optimizing the same objective, there is nothing to "federate away." Local steps do not create disagreement, so averaging is mathematically inert. FedAvg is then just ordinary optimization with delayed communication.
Proof Sketch
Induct on the local-step index . At all clients start from the same . If they are equal at step , then they apply the same update rule using the same objective and learning rate, so the step- iterates are also equal. The aggregate of identical vectors is that same vector.
Why It Matters
This isolates the true difficulty in federated learning: heterogeneity. When clients are homogeneous, local computation is harmless. When clients have different objectives, the local steps pull in different directions and averaging becomes approximate rather than exact.
Failure Mode
Real federated learning is almost never in this regime. Client sampling is partial, local optimization is stochastic, and the local objectives differ because user data are heterogeneous. Those effects are exactly what create client drift.
Heterogeneity and Client Drift
In practice, client data is almost never IID. A hospital in rural area sees different conditions than one in a city. Your phone keyboard reflects your vocabulary, not the population's.
One common way to quantify client mismatch is the gradient heterogeneity
If , all client gradients agree with the global gradient at that parameter. Large means local training directions differ substantially across clients.
Heterogeneity Amplifies with Local Steps
Statement
In representative convergence bounds for FedAvg / local-SGD methods, the optimization error can be decomposed schematically as
up to problem-dependent smoothness and participation constants. The key structural fact is that the heterogeneity penalty worsens with both the number of local steps and the gradient mismatch .
Intuition
Every local step follows , not . If clients disagree about the right direction, then taking more local steps before averaging lets that disagreement accumulate. Communication decreases, but bias from client drift increases.
Proof Sketch
Compare one round of local training on each with the corresponding centralized update on . Smoothness lets you control how gradient mismatch at one local step perturbs the next; summing these perturbations over local steps yields an error term whose dominant dependence grows with both and the heterogeneity level. This is the structural reason methods such as FedProx and SCAFFOLD help: they reduce the effective drift term.
Why It Matters
This is the real communication tradeoff in FL. On nearly homogeneous clients, several local epochs save bandwidth with little pain. On heavily non-IID data, the same choice of can destabilize training or degrade the eventual model, which is why FL systems almost always tune , sampling rate, and drift corrections together.
Failure Mode
The exact constants and even the sharp dependence on vary with the assumptions: convex vs nonconvex objectives, partial participation, bounded dissimilarity models, and whether the algorithm uses correction terms. The point of the proposition is not the exact prefactor; it is the mechanism that larger local training amplifies heterogeneity-induced bias.
Communication Efficiency
Communication is the bottleneck in federated learning. Sending full model updates over mobile networks is slow and expensive.
Gradient compression: quantize gradients to lower precision (e.g., 1-bit SGD) or sparsify them (send only the top- gradient entries). SignSGD sends only the sign of each gradient entry.
FedAvg itself is a communication reduction: by doing local epochs per round, you reduce the number of communication rounds by a factor of compared to distributed SGD.
Model distillation: instead of sending full model updates, clients send logits or soft labels on a shared public dataset. This compresses the communication to instead of .
Secure Aggregation and Privacy
Secure Aggregation
Secure aggregation is a cryptographic protocol that lets the server learn only an aggregate such as , not the individual client updates themselves. Bonawitz et al. (2017) is the canonical cross-device construction.
Secure aggregation and differential privacy solve different problems:
- Secure aggregation hides individual updates from the server and other clients during a round.
- Differential privacy bounds how much the aggregate output changes when a client (or one example) is added or removed.
Secure aggregation protects the transport layer of training; differential privacy protects against inference from the released aggregate or final model.
Federated Learning + Differential Privacy
Federated learning does not guarantee privacy by itself. Model updates can leak information about the training data (via gradient inversion attacks). Adding differential privacy provides formal guarantees.
In cross-device FL, the most common guarantee is user-level DP: neighboring datasets differ by the presence or absence of one client's entire local dataset. User-level DP is the stronger notion because one neighboring change can add or remove an entire client's data, not a single record. It is therefore usually stricter and more expensive than example-level DP, which only protects a single training example.
One standard user-level DP pattern is: clip each client's update to norm , aggregate the clipped updates, then add Gaussian noise calibrated to the aggregate sensitivity:
In deployed systems, the noise can be added centrally or in a distributed way on top of secure aggregation. The resulting mechanism provides -DP with respect to the chosen neighboring relation (typically one user / one client).
The cost: added noise reduces convergence speed. More privacy (smaller ) means more noise and slower convergence.
Common Confusions
Federated learning does not guarantee privacy
Raw model updates can reveal training data. Gradient inversion attacks can reconstruct input images from gradients with high fidelity. Federated learning must be combined with differential privacy, secure aggregation, or both to provide meaningful privacy guarantees.
Secure aggregation is not the same thing as differential privacy
Secure aggregation stops the server from reading individual updates in a round, but it does not by itself bound what can be learned from the aggregate or the final trained model. Differential privacy gives a formal participation guarantee; secure aggregation gives a cryptographic visibility guarantee. In practice, many systems want both.
FedAvg is not the same as distributed SGD
In distributed SGD, workers compute gradients on different data shards and average them every step. In FedAvg, clients run multiple local SGD steps before averaging. This distinction matters: FedAvg introduces client drift, which distributed SGD does not have (because it averages every step).
Cross-device and cross-silo federated learning are different regimes
Cross-device FL has massive client populations, partial participation, device dropout, and user-level privacy concerns. Cross-silo FL has far fewer parties, larger local datasets, and often stronger legal or contractual governance. Calling both simply "federated learning" is fine at a high level, but system design choices that make sense in one regime can be wrong in the other.
Canonical Examples
Mobile keyboard next-word prediction
Google's Gboard uses federated learning for next-word prediction. Each phone trains locally on the user's typing data. Model updates are sent to a server during idle charging time. The server averages updates across millions of devices. No individual keystrokes leave the phone. The model improves globally while each user's data remains local.
Summary
- Federated learning sends the model to the data, not data to the model
- FedAvg: selected clients do local optimization, then the server averages their updates weighted by participating data size
- In the homogeneous limit, FedAvg reduces to ordinary optimization; the real difficulty is heterogeneity
- Non-IID data creates client drift, and more local steps amplify that mismatch
- Communication efficiency comes from local computation, gradient compression, or distillation
- Privacy requires additional mechanisms: secure aggregation, differential privacy, or both
- Non-IID data is the primary open challenge
Exercises
Problem
You have 100 clients, each with 1000 examples. You run FedAvg with (selecting 10% of clients per round) and local epochs. How many total gradient steps are performed across all clients in one communication round?
Problem
Client A has data that is entirely class 0. Client B has data that is entirely class 1. Both start from the same initial model. After one round of FedAvg with local epochs, the averaged model performs poorly on both classes. Explain why, and propose a modification to FedAvg that would help.
References
Canonical:
- McMahan, Moore, Ramage, Hampson, and Aguera y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data (AISTATS 2017). The FedAvg paper.
- Bonawitz et al., Practical Secure Aggregation for Privacy-Preserving Machine Learning (CCS 2017). The canonical secure-aggregation reference for cross-device FL.
- Kairouz et al., Advances and Open Problems in Federated Learning (Foundations and Trends in Machine Learning 14, 2021), Sections 1-4. The standard survey reference.
Current:
- Li, Sahu, Talwalkar, and Smith, Federated Optimization in Heterogeneous Networks (MLSys 2020). FedProx and the bounded-dissimilarity framing.
- Karimireddy et al., SCAFFOLD: Stochastic Controlled Averaging for Federated Learning (ICML 2020). Control variates for client-drift correction.
- McMahan et al., Learning Differentially Private Recurrent Language Models (ICLR 2018). Early user-level DP in a federated-style mobile setting.
Next Topics
- Differential Privacy: formal privacy guarantees layered on top of federated optimization.
- Distributed Training Theory: the broader communication / optimization toolkit that FL draws on and departs from.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
Derived topics
2- Differential Privacylayer 3 · tier 2
- Distributed Training Theorylayer 5 · tier 3
Graph-backed continuations