Modern Generalization
PAC-Bayes Bounds
PAC-Bayes bounds control the generalization gap of a stochastic predictor by the KL divergence between a learner-chosen posterior and a data-independent prior. They have produced some of the few non-vacuous generalization bounds reported for trained neural networks (Dziugaite-Roy 2017 onward); how tight the bound gets depends heavily on the choice of prior.
Why This Matters
Classical generalization bounds such as VC dimension and Rademacher complexity ask how rich the entire hypothesis class is. For modern neural networks, that question is often too coarse. The class is expressive enough to memorize random labels, so the resulting bounds are typically vacuous: they predict generalization error larger than 1 even when the trained network performs well.
PAC-Bayes asks a sharper question: after training, how far did the learner move from a reference distribution chosen before seeing the data? The answer is measured by between a prior and a posterior . If sampled predictors from fit the training data and is not too far from , the test risk is small.
This is why PAC-Bayes matters so much in modern ML theory. It is one of the few general frameworks that:
- explains why parameter count alone is the wrong complexity measure for deep networks
- connects rigorously to compression, noise robustness, and flatness-style intuitions
- has produced genuinely non-vacuous certificates for neural networks
Mental Model
Imagine writing down your learned predictor in two stages.
- First, announce a reference distribution before seeing data. This might encode random initialization, a pretraining checkpoint, a symmetry-respecting Gaussian prior, or some other inductive bias that is fixed in advance.
- After seeing the data, choose a posterior that concentrates on predictors with low training error.
PAC-Bayes says the test error of the stochastic predictor is controlled by two quantities:
- how well predictors sampled from fit the training data
- how many extra bits are needed to describe relative to
That second quantity is exactly the KL term. So PAC-Bayes is best read as a code-length bound on learned solutions. You are not paying for the raw size of the network. You are paying for how much information the training data forced you to add on top of the prior.
Formal Setup
Let be a hypothesis space. Let be a prior distribution over , chosen before seeing data. Let be a posterior distribution over , which may depend on the training data drawn i.i.d. from .
Stochastic Classifier
A stochastic classifier draws and predicts using . The population risk of is:
The empirical risk of is:
KL Divergence
The Kullback-Leibler divergence from to is:
This measures how much diverges from the prior . It is zero when and infinite when has support outside .
Lean-verified finite PAC-Bayes layer
The current Lean artifact verifies the finite-distribution layer used inside PAC-Bayes proofs: KL nonnegativity, Donsker-Varadhan, posterior averages, and a posterior-gap change-of-measure adapter for finite PMFs. It also verifies the positive- rearrangement that turns a bounded prior log-moment into the canonical posterior-risk form, the finite product-sample MGF bridge for bounded losses, a finite [0,1] bounded-loss Catoni-style confidence bound, the fixed-budget McAllester-style square-root corollary, and a finite-grid McAllester-style peeling bound that supports posterior-dependent penalties through an explicit grid-cover certificate. The scope is finite throughout: finite data domain, finite hypothesis index, finite prior and posterior PMFs, positive confidence allocations, finite grids, and scalar losses in .
Finite-PMF KL Nonnegativity (Lean-verified)
Statement
For finite distributions and on the same finite index set, with for every :
The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.klDiv_nonneg.
Intuition
This is Gibbs' inequality in the finite setting. It is the basic reason the PAC-Bayes complexity penalty is never negative when the posterior is absolutely continuous with respect to the prior.
Proof Sketch
The Lean proof sums the per-coordinate inequality , then uses .
Failure Mode
This verifies the finite-PMF, full-support-prior setting. Sample-dependent PAC-Bayes bounds and measure-theoretic posteriors on continuous hypothesis spaces are separate extension targets.
Donsker-Varadhan Finite Change-of-Measure Inequality (Lean-verified)
Statement
For any finite posterior , full-support prior , and function :
The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.donsker_varadhan.
Intuition
This is the finite change-of-measure step used in PAC-Bayes proofs: it lets you replace an expectation under the learned posterior by a KL penalty plus a log-moment term under the prior.
Proof Sketch
The Lean proof defines the Gibbs posterior , applies KL nonnegativity to , and rearranges the resulting identity.
Failure Mode
This is the variational inequality layer. The finite bounded-loss confidence bound below adds the iid finite-product MGF and confidence step; continuous posteriors and infinite hypothesis spaces remain separate targets.
Finite Posterior-Gap Change-of-Measure Adapter (Lean-verified)
Statement
For any finite posterior , full-support prior , real parameter , and finite risk vectors and :
The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.posterior_generalization_gap_change_of_measure.
Intuition
This is the finite posterior-average form of Donsker-Varadhan applied to the pointwise generalization gap. It is the bridge from a posterior-weighted gap to a KL penalty plus a prior exponential moment.
Proof Sketch
The Lean proof first rewrites as , then applies the finite Donsker-Varadhan inequality with .
Failure Mode
This is the deterministic finite adapter. The finite bounded-loss confidence bound below adds the iid sample distribution, exponential-moment bound, and confidence parameter under [0,1] losses.
Finite Posterior-Risk Lambda Rearrangement (Lean-verified)
Statement
For any finite posterior , full-support prior , positive , finite risk vectors and , and log-moment bound satisfying
the posterior risk obeys
The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.posterior_risk_le_empiricalRisk_plus_of_log_moment_bound.
Intuition
This is the algebraic PAC-Bayes shape after change-of-measure but before probability enters. It says that once a separate iid-sample argument controls the prior exponential moment, every posterior pays only KL plus that log-moment bound, scaled by .
Proof Sketch
The Lean proof divides the finite posterior-gap change-of-measure inequality by positive , substitutes the external log-moment bound, and rearranges into posterior-risk form.
Failure Mode
This is still deterministic and finite. The finite bounded-loss confidence bound below instantiates for finite iid samples and [0,1] losses; the finite-grid square-root theorem adds a confidence-grid layer for posterior-dependent penalties. Measure-theoretic PAC-Bayes remains a separate target.
Finite Prior-Moment Markov Confidence Adapter (Lean-verified)
Statement
For a finite sample distribution , prior , parameter , finite risks , sample-dependent empirical risks , and constants and , if
then the finite sample mass of outcomes where the prior moment exceeds is at most :
The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.priorExpMoment_tailMass_le_delta_of_expected_bound.
Intuition
This is the Markov step that turns an expected prior exponential-moment bound into a high-probability statement. It is useful because PAC-Bayes sample bounds need exactly this kind of event before applying the deterministic posterior-risk adapter.
Proof Sketch
The Lean proof defines the finite prior exponential moment for each sample outcome, proves it is nonnegative under a nonnegative prior, applies a finite weighted Markov inequality to the threshold , and simplifies .
Failure Mode
This is finite-support Markov only. The finite bounded-loss confidence theorem below supplies one concrete iid product-sample instantiation for [0,1] losses; it is not a continuous-posterior or infinite-class PAC-Bayes theorem.
Finite Bounded-Loss Catoni-Style PAC-Bayes Bound (Lean-verified)
Statement
For a finite data distribution , full-support finite prior , positive sample size , positive , positive , and losses , the product-sample mass of samples for which there exists a finite posterior violating
is at most .
The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteCatoni_badEventMass_le_delta.
Intuition
This is an end-to-end finite PAC-Bayes confidence statement in the Lean stack. Hoeffding controls the one-coordinate bounded-loss MGF, the finite product bridge lifts it to an iid sample average, Markov turns the prior exponential moment into a confidence event, and Donsker-Varadhan transfers that event from the prior to every finite posterior.
Proof Sketch
The Lean proof combines the one-coordinate bounded-loss MGF, the finite product-sample MGF factorization, prior averaging, a finite weighted Markov step, and the deterministic posterior-risk adapter. The final statement bounds the mass of bad samples by showing every sample that violates the posterior-risk inequality must also lie in the prior-moment tail event.
Failure Mode
This is finite, scalar-valued, and fixed-. The square-root corollaries below optimize through fixed budgets and finite grids; continuous posteriors and infinite hypothesis spaces remain separate targets.
Finite Fixed-Budget McAllester-Style PAC-Bayes Bound (Lean-verified)
Statement
For a finite data distribution , full-support finite prior , positive sample size , positive confidence parameter , positive complexity budget , and losses , the product-sample mass of samples for which there exists a finite posterior with
but violating
is at most .
The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteMcAllesterBoundedComplexity_badEventMass_le_delta.
Intuition
The fixed- Catoni theorem has a penalty once a posterior's KL-plus-confidence term is bounded by . Choosing turns that expression into the familiar square-root penalty .
Proof Sketch
The Lean proof first proves the algebraic optimizer identity for the fixed budget, then applies the deterministic posterior-risk adapter at . The bad-event theorem reduces any square-root violation to the same prior-moment tail event already controlled by the finite bounded-loss Catoni layer.
Failure Mode
This is a fixed-budget finite corollary. The finite-grid theorem below removes the single-budget restriction using finitely many confidence buckets, but continuous posteriors, infinite hypothesis spaces, and all-real- optimization remain separate targets.
Finite-Grid Optimized McAllester-Style PAC-Bayes Bound (Lean-verified)
Statement
For a finite data distribution , full-support finite prior , positive sample size , losses , and a finite grid of positive complexity budgets with confidence allocations , suppose . If a posterior-dependent penalty is covered by the grid in the sense that every finite posterior has some bucket satisfying
then the product-sample mass of samples for which there exists a finite posterior violating
is at most .
The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteMcAllesterGridOptimized_badEventMass_le_delta.
Intuition
The fixed-budget theorem controls one square-root bucket. The grid theorem gives each bucket its own confidence allocation and then applies a finite union bound. The optimized wrapper says that if every posterior can be assigned to a bucket whose square-root penalty is no larger than the penalty you want to use, the same high-probability event controls all finite posteriors at once.
Proof Sketch
The Lean proof defines the finite bad-sample set for each bucket, applies the fixed-budget McAllester theorem to each bucket, sums the bucket masses with the finite weighted union-bound lemma, and then uses the grid-cover certificate to move from bucket-specific violations to the posterior-dependent penalty event.
Failure Mode
This is a finite-grid theorem. It does not quantify over all real , continuous posterior families, infinite hypothesis classes, or arbitrary measurable hypothesis spaces.
What The Bound Is Charging You For
The canonical PAC-Bayes structure looks like:
So there are only three real levers:
- Empirical stochastic risk. If samples from already make mistakes on the training set, the bound starts from a bad baseline.
- Relative information cost. If must be extremely concentrated or far from , the KL term explodes.
- Sample size. PAC-Bayes still needs data; the KL term only becomes harmless when it is small relative to .
This decomposition is why the framework is so useful. It makes the failure mode visible. When a bound is vacuous, you can ask whether the issue is the empirical loss, the prior, the posterior width, or simply insufficient sample size.
Main Theorems
McAllester PAC-Bayes Bound
Statement
For any prior over chosen independently of , for any , with probability at least over the draw of , simultaneously for all posteriors :
Intuition
The bound has two terms. The first is the training performance of the stochastic classifier. The second is a complexity penalty that grows with . If the posterior concentrates near the prior, the penalty is small. The bound holds simultaneously for all posteriors , so you can optimize after seeing data without invalidating the guarantee.
Proof Sketch
The proof uses a change-of-measure argument. Start from the moment generating function: for any fixed , is bounded by Hoeffding's lemma. Then use the Donsker-Varadhan variational formula: for any measurable . Apply this with , take expectations over , and apply Markov's inequality.
Why It Matters
Unlike VC and Rademacher bounds, the PAC-Bayes bound depends on the relationship between the learned solution and a reference prior, not on the raw capacity of the hypothesis class. For a network with 10 million parameters, if the learned weights are close to initialization (a natural prior), the KL term can be small enough to give non-vacuous bounds.
Failure Mode
The bound is only useful when is much smaller than . If the prior is poorly chosen (far from the learned solution), the KL term dominates and the bound becomes vacuous. In practice, choosing the prior is the main difficulty. The bound also requires bounded loss; for unbounded losses, modified versions with sub-Gaussian assumptions are needed.
Catoni PAC-Bayes Bound
Statement
For any prior , any , and any , with probability at least over , for all posteriors :
Intuition
Catoni's bound is tighter than McAllester's because it avoids the square root. The parameter can be optimized. For small empirical risk and small KL, Catoni gives substantially better numerical bounds.
Proof Sketch
Uses the exponential moment method directly. For bounded losses, the cumulant generating function satisfies . Combine with Donsker-Varadhan and take expectations over .
Why It Matters
In the race to produce non-vacuous bounds for deep networks, Catoni's bound consistently produces tighter numerical values than McAllester's. Dziugaite and Roy (2017) used optimized Catoni bounds to get the first non-vacuous generalization bound for a neural network on MNIST.
Failure Mode
Same prior-dependence issue as McAllester. The parameter must be chosen or optimized, adding another degree of freedom. For very small training error, the bound approaches the KL term divided by , which is the irreducible cost of choosing the posterior.
Why PAC-Bayes Is Competitive for Deep Learning
The key insight is that overparameterized networks may have enormous raw capacity while still landing in solutions with modest effective complexity relative to a sensible prior. PAC-Bayes captures that through several complementary mechanisms.
Compression view. If is a Gaussian around the learned weights and is a broader Gaussian around initialization, then acts like the number of bits needed to describe the solution relative to initialization. This is why PAC-Bayes and compression-based generalization often feel like two views of the same story.
Noise robustness. If the predictor keeps working when you perturb the weights, then you can choose a broader posterior . Broader posteriors are cheaper in KL, so noise-robust solutions are easier to certify. This is the rigorous core behind the "flat minima" intuition, though PAC-Bayes is more careful than naive flatness metrics.
Reference-relative complexity. A network with ten million parameters can still have a small PAC-Bayes complexity if it stays close to a good prior. Parameter count by itself is not the operative quantity.
Data-dependent priors. If you use one data split to build the prior and a disjoint split to certify the posterior, the prior can sit much closer to the eventual solution. Ambroladze, Parrado-Hernandez, and Shawe-Taylor (2007) formalized this idea.
What Practitioners Actually Optimize
In modern PAC-Bayes work, the goal is usually not to apply the theorem after training as an afterthought. Instead, people optimize a surrogate objective that balances:
- low empirical risk under posterior noise
- low KL to the chosen prior
For Gaussian posteriors over network weights, this means learning both a center and a scale. The center wants to fit the data. The scale wants to stay as broad as possible without destroying the empirical risk. That tradeoff is exactly why the bound is interesting: it turns "robust to perturbations" into a quantitative certificate.
This is also why Catoni-style bounds matter in practice. Once you are numerically optimizing a certificate, constants and square roots stop being cosmetic. They can be the difference between vacuous and non-vacuous.
From Stochastic to Deterministic Predictors
The theorem directly certifies the Gibbs classifier that samples at prediction time. But most deployed networks are deterministic. So why does PAC-Bayes still matter?
There are two answers.
First, some papers certify the stochastic network itself. This is already meaningful: it shows that a noisy version of the trained predictor generalizes.
Second, if predictions are stable under posterior noise, then the deterministic center of and the stochastic predictor often behave similarly on most examples. PAC-Bayes does not give that deterministic guarantee for free, but it provides the right quantitative object: a posterior that is both accurate and robust to perturbation. In binary classification there are also PAC-Bayesian majority-vote bounds, but the Gibbs view is the simplest entry point.
Comparison with Classical Bounds
| Bound type | Depends on | Deep net behavior |
|---|---|---|
| VC dimension | Number of parameters | Vacuous (billions of params) |
| Rademacher complexity | Norm constraints on weights | Often vacuous for practical norms |
| PAC-Bayes | KL(posterior, prior) | Non-vacuous with good prior |
| Algorithmic stability | Sensitivity to one sample | Bounded for SGD with early stopping |
PAC vs. PAC-Bayes: When Each One Wins
PAC-Bayes is not a replacement for classical PAC bounds. They occupy different regimes.
| Question | Classical PAC | PAC-Bayes |
|---|---|---|
| What does the bound depend on? | VC / Rademacher complexity of the whole class | KL between learned posterior and chosen prior |
| Sample complexity in | ||
| Tight in worst case? | Yes (matching VC lower bounds) | No worst-case lower bound; depends on prior |
| Useful for overparameterized nets? | No (capacity penalty too large) | Yes (with a prior near the learned solution) |
| Requires data-independent prior? | No | Yes (or split data) |
| Sample complexity in at fixed | (same MGF route) |
The right way to read PAC-Bayes: it does not improve the rate that worst-case PAC theory shows is unimprovable. It improves the complexity term, replacing (which can be for a -parameter network) with (which can be much smaller if the prior is well-chosen). Both bounds are vacuous when the complexity term exceeds ; both are useful when the complexity term is much smaller than . PAC-Bayes wins in regimes where the worst-case capacity is huge but the learned solution is close to a sensible prior.
Worked Numeric Example: How a Good Prior Salvages the Bound
The advanced exercise below shows a vacuous bound for a poorly-chosen prior. Here is the same setup with two choices of prior, side by side. Network: parameters, training samples, empirical stochastic risk , .
| Prior | McAllester penalty | Bound | Vacuous? | |
|---|---|---|---|---|
| , | Yes | |||
| , | Yes | |||
| , | No | |||
| Data-dependent prior , | No |
Three observations:
- Naive prior fails. A spherical Gaussian centered at zero charges the bound for , which scales with the network's parameter count. Vacuous.
- Centering the prior at initialization changes the KL term from to . If training only moves the weights moderately, the gap is much smaller. Bound becomes non-vacuous.
- Data-dependent priors (Ambroladze-Parrado-Hernandez-Shawe-Taylor 2007): split data, train a "warm-start" prior on , certify on . The KL term shrinks further because is even closer to than to . This is the route to genuinely tight modern certificates.
The same data, the same loss, the same algorithm — only the prior changed. PAC-Bayes is a framework for converting modeling choices about the prior into rigorous certificates; the modeling choices matter as much as the bound itself.
Common Confusions
PAC-Bayes does not require Bayesian inference
Despite the name, PAC-Bayes bounds are frequentist. The prior is not a belief; it is a reference distribution chosen for the bound. The posterior is not computed by Bayes rule; it is any distribution, typically optimized to minimize the bound. You can apply PAC-Bayes to a deterministic network by setting to be a Gaussian centered on the trained weights.
Non-vacuous does not mean tight
The first non-vacuous bound for MNIST (Dziugaite and Roy, 2017) was about 0.16 for a network with test error around 0.02. Non-vacuous means the bound is less than 1 (better than random guessing), which is a milestone but not a precise characterization of generalization.
PAC-Bayes is not the same thing as naive flat-minima folklore
It is tempting to summarize PAC-Bayes as "flat minima generalize." That is too loose. Simple flatness metrics can change under reparameterization and therefore fail as intrinsic complexity measures. PAC-Bayes is more disciplined: you must specify an actual posterior, an actual prior, and an actual KL cost. Flatness only matters insofar as it lets you choose a broad posterior with low empirical stochastic risk.
The prior must be independent of training data
If you set after looking at the training data, the bound is invalid. The data-splitting trick (use half the data for , half for the bound) is the standard workaround. Forgetting this independence requirement is a common error.
Canonical Examples
Gaussian prior and posterior
Let and where are the trained weights and is the number of parameters. Then:
For (narrow posterior, broad prior), this is approximately . The first term penalizes large weights; the second penalizes many parameters that must be specified precisely. If can be large (the network is insensitive to weight perturbations), the second term shrinks.
How a good prior changes the story
Suppose you use random initialization as the center of the prior. Then the KL term pays for the distance between the learned weights and initialization, plus the precision of the posterior. If training only moves the weights moderately and the loss stays low under noise, the certificate can be surprisingly good even in a huge network.
Now suppose instead that your prior is centered far from any good solution. The posterior must travel a long distance or become extremely sharp to fit the data, and the KL term becomes enormous. This is why prior design is the central craft problem in PAC-Bayes.
Summary
- PAC-Bayes controls generalization through empirical stochastic risk plus a KL penalty
- The framework charges for information added beyond the prior, not for raw parameter count
- Good certificates come from a well-chosen prior, a broad accurate posterior, and enough data
- Catoni-style bounds matter because practical PAC-Bayes work is numerical, not just asymptotic
- The framework gives a rigorous formulation of two folklore intuitions about deep nets — that flat / compressible solutions generalize — but only via specific prior and posterior choices; flatness alone is not the same statement
Exercises
Problem
Let and for some . Compute . How does it scale with and ?
Problem
A network has parameters trained on MNIST images with 0-1 loss. You set and where . The empirical risk of (measured by sampling from and averaging) is 0.02. Using McAllester's bound with , is the bound non-vacuous?
Problem
Explain why data-dependent priors (using part of the training set to choose ) can dramatically improve PAC-Bayes bounds. What is the formal justification, and what fraction of data should you allocate to the prior?
References
Canonical:
- McAllester, "PAC-Bayesian Model Averaging" (1999), Sections 1-3
- Seeger, "PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification" (2002), Sections 1-2
- Catoni, PAC-Bayesian Supervised Classification (2007), Chapters 1-2
- Guedj, "A Primer on PAC-Bayesian Learning" (2019)
Current:
- Ambroladze, Parrado-Hernandez, Shawe-Taylor, "Tighter PAC-Bayes Bounds" (2007)
- Dziugaite & Roy, "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks" (2017)
- Perez-Ortiz, Rivasplata, Shawe-Taylor, Szepesvari, "Tighter Risk Certificates for Neural Networks" (JMLR 2021)
Next Topics
- Algorithmic stability: an alternative path to generalization bounds that does not require a prior
- Implicit bias and modern generalization: why SGD finds solutions with small PAC-Bayes complexity
Last reviewed: May 4, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- PAC Learning Frameworklayer 1 · tier 1
- Sample Complexity Boundslayer 2 · tier 1
- Rademacher Complexitylayer 3 · tier 1
- Bayesian Estimationlayer 0B · tier 2
Derived topics
2- Algorithmic Stabilitylayer 3 · tier 1
- Implicit Bias and Modern Generalizationlayer 4 · tier 1
Graph-backed continuations