Skip to main content

Modern Generalization

PAC-Bayes Bounds

PAC-Bayes bounds control the generalization gap of a stochastic predictor by the KL divergence between a learner-chosen posterior and a data-independent prior. They have produced some of the few non-vacuous generalization bounds reported for trained neural networks (Dziugaite-Roy 2017 onward); how tight the bound gets depends heavily on the choice of prior.

AdvancedTier 1CurrentCore spine~30 min

Why This Matters

Classical generalization bounds such as VC dimension and Rademacher complexity ask how rich the entire hypothesis class is. For modern neural networks, that question is often too coarse. The class is expressive enough to memorize random labels, so the resulting bounds are typically vacuous: they predict generalization error larger than 1 even when the trained network performs well.

PAC-Bayes asks a sharper question: after training, how far did the learner move from a reference distribution chosen before seeing the data? The answer is measured by KL(QP)\mathrm{KL}(Q \| P) between a prior PP and a posterior QQ. If sampled predictors from QQ fit the training data and QQ is not too far from PP, the test risk is small.

This is why PAC-Bayes matters so much in modern ML theory. It is one of the few general frameworks that:

  • explains why parameter count alone is the wrong complexity measure for deep networks
  • connects rigorously to compression, noise robustness, and flatness-style intuitions
  • has produced genuinely non-vacuous certificates for neural networks

Mental Model

Imagine writing down your learned predictor in two stages.

  1. First, announce a reference distribution PP before seeing data. This might encode random initialization, a pretraining checkpoint, a symmetry-respecting Gaussian prior, or some other inductive bias that is fixed in advance.
  2. After seeing the data, choose a posterior QQ that concentrates on predictors with low training error.

PAC-Bayes says the test error of the stochastic predictor hQh \sim Q is controlled by two quantities:

  • how well predictors sampled from QQ fit the training data
  • how many extra bits are needed to describe QQ relative to PP

That second quantity is exactly the KL term. So PAC-Bayes is best read as a code-length bound on learned solutions. You are not paying for the raw size of the network. You are paying for how much information the training data forced you to add on top of the prior.

Formal Setup

Let H\mathcal{H} be a hypothesis space. Let PP be a prior distribution over H\mathcal{H}, chosen before seeing data. Let QQ be a posterior distribution over H\mathcal{H}, which may depend on the training data S={(x1,y1),,(xn,yn)}S = \{(x_1, y_1), \ldots, (x_n, y_n)\} drawn i.i.d. from D\mathcal{D}.

Definition

Stochastic Classifier

A stochastic classifier draws hQh \sim Q and predicts using hh. The population risk of QQ is:

R(Q)=EhQ[R(h)]=EhQ[E(x,y)D[(h(x),y)]]R(Q) = \mathbb{E}_{h \sim Q}[R(h)] = \mathbb{E}_{h \sim Q}\left[\mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]\right]

The empirical risk of QQ is:

R^n(Q)=EhQ[1ni=1n(h(xi),yi)]\hat{R}_n(Q) = \mathbb{E}_{h \sim Q}\left[\frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i)\right]

Definition

KL Divergence

The Kullback-Leibler divergence from QQ to PP is:

KL(QP)=EhQ[logQ(h)P(h)]\mathrm{KL}(Q \| P) = \mathbb{E}_{h \sim Q}\left[\log \frac{Q(h)}{P(h)}\right]

This measures how much QQ diverges from the prior PP. It is zero when Q=PQ = P and infinite when QQ has support outside PP.

Lean-verified finite PAC-Bayes layer

The current Lean artifact verifies the finite-distribution layer used inside PAC-Bayes proofs: KL nonnegativity, Donsker-Varadhan, posterior averages, and a posterior-gap change-of-measure adapter for finite PMFs. It also verifies the positive-λ\lambda rearrangement that turns a bounded prior log-moment into the canonical posterior-risk form, the finite product-sample MGF bridge for bounded losses, a finite [0,1] bounded-loss Catoni-style confidence bound, the fixed-budget McAllester-style square-root corollary, and a finite-grid McAllester-style peeling bound that supports posterior-dependent penalties through an explicit grid-cover certificate. The scope is finite throughout: finite data domain, finite hypothesis index, finite prior and posterior PMFs, positive confidence allocations, finite grids, and scalar losses in [0,1][0,1].

Lemma

Finite-PMF KL Nonnegativity (Lean-verified)

Statement

For finite distributions ρ\rho and π\pi on the same finite index set, with πi>0\pi_i > 0 for every ii:

KL(ρπ)=iρilogρiπi0.\mathrm{KL}(\rho \| \pi) = \sum_i \rho_i \log\frac{\rho_i}{\pi_i} \geq 0.

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.klDiv_nonneg.

Intuition

This is Gibbs' inequality in the finite setting. It is the basic reason the PAC-Bayes complexity penalty is never negative when the posterior is absolutely continuous with respect to the prior.

Proof Sketch

The Lean proof sums the per-coordinate inequality ρilog(ρi/πi)ρiπi\rho_i\log(\rho_i/\pi_i) \geq \rho_i-\pi_i, then uses iρi=iπi=1\sum_i \rho_i = \sum_i \pi_i = 1.

Failure Mode

This verifies the finite-PMF, full-support-prior setting. Sample-dependent PAC-Bayes bounds and measure-theoretic posteriors on continuous hypothesis spaces are separate extension targets.

Lemma

Donsker-Varadhan Finite Change-of-Measure Inequality (Lean-verified)

Statement

For any finite posterior ρ\rho, full-support prior π\pi, and function ff:

iρifiKL(ρπ)+log(iπiefi).\sum_i \rho_i f_i \leq \mathrm{KL}(\rho \| \pi) + \log\left(\sum_i \pi_i e^{f_i}\right).

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.donsker_varadhan.

Intuition

This is the finite change-of-measure step used in PAC-Bayes proofs: it lets you replace an expectation under the learned posterior by a KL penalty plus a log-moment term under the prior.

Proof Sketch

The Lean proof defines the Gibbs posterior qiπiefiq_i \propto \pi_i e^{f_i}, applies KL nonnegativity to KL(ρq)\mathrm{KL}(\rho\|q), and rearranges the resulting identity.

Failure Mode

This is the variational inequality layer. The finite bounded-loss confidence bound below adds the iid finite-product MGF and confidence step; continuous posteriors and infinite hypothesis spaces remain separate targets.

Lemma

Finite Posterior-Gap Change-of-Measure Adapter (Lean-verified)

Statement

For any finite posterior ρ\rho, full-support prior π\pi, real parameter λ\lambda, and finite risk vectors RiR_i and R^i\hat R_i:

λ(RρR^ρ)KL(ρπ)+log(iπiexp(λ(RiR^i))).\lambda\,(R_\rho - \hat R_\rho) \leq \mathrm{KL}(\rho \| \pi) + \log\left(\sum_i \pi_i \exp(\lambda(R_i-\hat R_i))\right).

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.posterior_generalization_gap_change_of_measure.

Intuition

This is the finite posterior-average form of Donsker-Varadhan applied to the pointwise generalization gap. It is the bridge from a posterior-weighted gap to a KL penalty plus a prior exponential moment.

Proof Sketch

The Lean proof first rewrites RρR^ρR_\rho-\hat R_\rho as iρi(RiR^i)\sum_i \rho_i(R_i-\hat R_i), then applies the finite Donsker-Varadhan inequality with fi=λ(RiR^i)f_i=\lambda(R_i-\hat R_i).

Failure Mode

This is the deterministic finite adapter. The finite bounded-loss confidence bound below adds the iid sample distribution, exponential-moment bound, and confidence parameter δ\delta under [0,1] losses.

Lemma

Finite Posterior-Risk Lambda Rearrangement (Lean-verified)

Statement

For any finite posterior ρ\rho, full-support prior π\pi, positive λ\lambda, finite risk vectors RiR_i and R^i\hat R_i, and log-moment bound cc satisfying

log(iπiexp(λ(RiR^i)))c,\log\left(\sum_i \pi_i \exp(\lambda(R_i-\hat R_i))\right) \leq c,

the posterior risk obeys

RρR^ρ+KL(ρπ)+cλ.R_\rho \leq \hat R_\rho + \frac{\mathrm{KL}(\rho \| \pi) + c}{\lambda}.

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.posterior_risk_le_empiricalRisk_plus_of_log_moment_bound.

Intuition

This is the algebraic PAC-Bayes shape after change-of-measure but before probability enters. It says that once a separate iid-sample argument controls the prior exponential moment, every posterior pays only KL plus that log-moment bound, scaled by 1/λ1/\lambda.

Proof Sketch

The Lean proof divides the finite posterior-gap change-of-measure inequality by positive λ\lambda, substitutes the external log-moment bound, and rearranges RρR^ρ(KL+c)/λR_\rho-\hat R_\rho \leq (\mathrm{KL}+c)/\lambda into posterior-risk form.

Failure Mode

This is still deterministic and finite. The finite bounded-loss confidence bound below instantiates cc for finite iid samples and [0,1] losses; the finite-grid square-root theorem adds a confidence-grid layer for posterior-dependent penalties. Measure-theoretic PAC-Bayes remains a separate target.

Lemma

Finite Prior-Moment Markov Confidence Adapter (Lean-verified)

Statement

For a finite sample distribution ν\nu, prior π\pi, parameter λ\lambda, finite risks RiR_i, sample-dependent empirical risks R^i(ω)\hat R_i(\omega), and constants C>0C>0 and δ>0\delta>0, if

Eων[iπiexp(λ(RiR^i(ω)))]C,\mathbb{E}_{\omega\sim\nu}\left[\sum_i \pi_i \exp(\lambda(R_i-\hat R_i(\omega)))\right] \leq C,

then the finite sample mass of outcomes where the prior moment exceeds C/δC/\delta is at most δ\delta:

ν{ω:iπiexp(λ(RiR^i(ω)))C/δ}δ.\nu\left\{\omega: \sum_i \pi_i \exp(\lambda(R_i-\hat R_i(\omega))) \geq C/\delta\right\} \leq \delta.

The Lean theorem is TheoremPath.LearningTheory.PACBayesKL.priorExpMoment_tailMass_le_delta_of_expected_bound.

Intuition

This is the Markov step that turns an expected prior exponential-moment bound into a high-probability statement. It is useful because PAC-Bayes sample bounds need exactly this kind of event before applying the deterministic posterior-risk adapter.

Proof Sketch

The Lean proof defines the finite prior exponential moment for each sample outcome, proves it is nonnegative under a nonnegative prior, applies a finite weighted Markov inequality to the threshold C/δC/\delta, and simplifies C/(C/δ)=δC/(C/\delta)=\delta.

Failure Mode

This is finite-support Markov only. The finite bounded-loss confidence theorem below supplies one concrete iid product-sample instantiation for [0,1] losses; it is not a continuous-posterior or infinite-class PAC-Bayes theorem.

Theorem

Finite Bounded-Loss Catoni-Style PAC-Bayes Bound (Lean-verified)

Statement

For a finite data distribution pp, full-support finite prior π\pi, positive sample size nn, positive λ\lambda, positive δ\delta, and losses i(z)[0,1]\ell_i(z)\in[0,1], the product-sample mass of samples SS for which there exists a finite posterior ρ\rho violating

R(ρ)R^S(ρ)+KL(ρπ)+log(1/δ)λ+λ8nR(\rho) \leq \hat R_S(\rho) + \frac{\mathrm{KL}(\rho\|\pi)+\log(1/\delta)}{\lambda} + \frac{\lambda}{8n}

is at most δ\delta.

The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteCatoni_badEventMass_le_delta.

Intuition

This is an end-to-end finite PAC-Bayes confidence statement in the Lean stack. Hoeffding controls the one-coordinate bounded-loss MGF, the finite product bridge lifts it to an iid sample average, Markov turns the prior exponential moment into a confidence event, and Donsker-Varadhan transfers that event from the prior to every finite posterior.

Proof Sketch

The Lean proof combines the one-coordinate bounded-loss MGF, the finite product-sample MGF factorization, prior averaging, a finite weighted Markov step, and the deterministic posterior-risk adapter. The final statement bounds the mass of bad samples by showing every sample that violates the posterior-risk inequality must also lie in the prior-moment tail event.

Failure Mode

This is finite, scalar-valued, and fixed-λ\lambda. The square-root corollaries below optimize through fixed budgets and finite grids; continuous posteriors and infinite hypothesis spaces remain separate targets.

Theorem

Finite Fixed-Budget McAllester-Style PAC-Bayes Bound (Lean-verified)

Statement

For a finite data distribution pp, full-support finite prior π\pi, positive sample size nn, positive confidence parameter δ\delta, positive complexity budget CC, and losses i(z)[0,1]\ell_i(z)\in[0,1], the product-sample mass of samples SS for which there exists a finite posterior ρ\rho with

KL(ρπ)+log(1/δ)C\mathrm{KL}(\rho\|\pi)+\log(1/\delta)\leq C

but violating

R(ρ)R^S(ρ)+C2nR(\rho) \leq \hat R_S(\rho) + \sqrt{\frac{C}{2n}}

is at most δ\delta.

The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteMcAllesterBoundedComplexity_badEventMass_le_delta.

Intuition

The fixed-λ\lambda Catoni theorem has a penalty C/λ+λ/(8n)C/\lambda+\lambda/(8n) once a posterior's KL-plus-confidence term is bounded by CC. Choosing λ=8nC\lambda=\sqrt{8nC} turns that expression into the familiar square-root penalty C/(2n)\sqrt{C/(2n)}.

Proof Sketch

The Lean proof first proves the algebraic optimizer identity for the fixed budget, then applies the deterministic posterior-risk adapter at λ=8nC\lambda=\sqrt{8nC}. The bad-event theorem reduces any square-root violation to the same prior-moment tail event already controlled by the finite bounded-loss Catoni layer.

Failure Mode

This is a fixed-budget finite corollary. The finite-grid theorem below removes the single-budget restriction using finitely many confidence buckets, but continuous posteriors, infinite hypothesis spaces, and all-real-λ\lambda optimization remain separate targets.

Theorem

Finite-Grid Optimized McAllester-Style PAC-Bayes Bound (Lean-verified)

Statement

For a finite data distribution pp, full-support finite prior π\pi, positive sample size nn, losses i(z)[0,1]\ell_i(z)\in[0,1], and a finite grid of positive complexity budgets CgC_g with confidence allocations δg\delta_g, suppose gδgδ\sum_g\delta_g\leq\delta. If a posterior-dependent penalty η(ρ)\eta(\rho) is covered by the grid in the sense that every finite posterior ρ\rho has some bucket gg satisfying

KL(ρπ)+log(1/δg)CgandCg2nη(ρ),\mathrm{KL}(\rho\|\pi)+\log(1/\delta_g)\leq C_g \quad\text{and}\quad \sqrt{\frac{C_g}{2n}}\leq\eta(\rho),

then the product-sample mass of samples SS for which there exists a finite posterior ρ\rho violating

R(ρ)R^S(ρ)+η(ρ)R(\rho)\leq \hat R_S(\rho)+\eta(\rho)

is at most δ\delta.

The Lean theorem is TheoremPath.LearningTheory.PACBayesBoundedLoss.finiteMcAllesterGridOptimized_badEventMass_le_delta.

Intuition

The fixed-budget theorem controls one square-root bucket. The grid theorem gives each bucket its own confidence allocation and then applies a finite union bound. The optimized wrapper says that if every posterior can be assigned to a bucket whose square-root penalty is no larger than the penalty you want to use, the same high-probability event controls all finite posteriors at once.

Proof Sketch

The Lean proof defines the finite bad-sample set for each bucket, applies the fixed-budget McAllester theorem to each bucket, sums the bucket masses with the finite weighted union-bound lemma, and then uses the grid-cover certificate to move from bucket-specific violations to the posterior-dependent penalty event.

Failure Mode

This is a finite-grid theorem. It does not quantify over all real λ\lambda, continuous posterior families, infinite hypothesis classes, or arbitrary measurable hypothesis spaces.

What The Bound Is Charging You For

The canonical PAC-Bayes structure looks like:

test risk    empirical stochastic risk+KL(QP)+log(1/δ)n\text{test risk} \;\lesssim\; \text{empirical stochastic risk} + \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(1/\delta)}{n}}

So there are only three real levers:

  • Empirical stochastic risk. If samples from QQ already make mistakes on the training set, the bound starts from a bad baseline.
  • Relative information cost. If QQ must be extremely concentrated or far from PP, the KL term explodes.
  • Sample size. PAC-Bayes still needs data; the KL term only becomes harmless when it is small relative to nn.

This decomposition is why the framework is so useful. It makes the failure mode visible. When a bound is vacuous, you can ask whether the issue is the empirical loss, the prior, the posterior width, or simply insufficient sample size.

Main Theorems

Theorem

McAllester PAC-Bayes Bound

Statement

For any prior PP over H\mathcal{H} chosen independently of SS, for any δ>0\delta > 0, with probability at least 1δ1 - \delta over the draw of SS, simultaneously for all posteriors QQ:

R(Q)R^n(Q)+KL(QP)+log(n/δ)2nR(Q) \leq \hat{R}_n(Q) + \sqrt{\frac{\mathrm{KL}(Q \| P) + \log(n/\delta)}{2n}}

Intuition

The bound has two terms. The first is the training performance of the stochastic classifier. The second is a complexity penalty that grows with KL(QP)\mathrm{KL}(Q \| P). If the posterior concentrates near the prior, the penalty is small. The bound holds simultaneously for all posteriors QQ, so you can optimize QQ after seeing data without invalidating the guarantee.

Proof Sketch

The proof uses a change-of-measure argument. Start from the moment generating function: for any fixed hh, ES[en(R(h)R^n(h))2]\mathbb{E}_S[e^{n(R(h) - \hat{R}_n(h))^2}] is bounded by Hoeffding's lemma. Then use the Donsker-Varadhan variational formula: EhQ[f(h)]KL(QP)+logEhP[ef(h)]\mathbb{E}_{h \sim Q}[f(h)] \leq \mathrm{KL}(Q \| P) + \log \mathbb{E}_{h \sim P}[e^{f(h)}] for any measurable ff. Apply this with f(h)=n(R(h)R^n(h))2f(h) = n(R(h) - \hat{R}_n(h))^2, take expectations over SS, and apply Markov's inequality.

Why It Matters

Unlike VC and Rademacher bounds, the PAC-Bayes bound depends on the relationship between the learned solution and a reference prior, not on the raw capacity of the hypothesis class. For a network with 10 million parameters, if the learned weights are close to initialization (a natural prior), the KL term can be small enough to give non-vacuous bounds.

Failure Mode

The bound is only useful when KL(QP)\mathrm{KL}(Q \| P) is much smaller than nn. If the prior is poorly chosen (far from the learned solution), the KL term dominates and the bound becomes vacuous. In practice, choosing the prior is the main difficulty. The bound also requires bounded loss; for unbounded losses, modified versions with sub-Gaussian assumptions are needed.

Theorem

Catoni PAC-Bayes Bound

Statement

For any prior PP, any δ>0\delta > 0, and any λ>0\lambda > 0, with probability at least 1δ1 - \delta over SS, for all posteriors QQ:

R(Q)11eλ(1exp(λR^n(Q)KL(QP)+log(1/δ)n))R(Q) \leq \frac{1}{1 - e^{-\lambda}}\left(1 - \exp\left(-\lambda \hat{R}_n(Q) - \frac{\mathrm{KL}(Q \| P) + \log(1/\delta)}{n}\right)\right)

Intuition

Catoni's bound is tighter than McAllester's because it avoids the square root. The parameter λ\lambda can be optimized. For small empirical risk and small KL, Catoni gives substantially better numerical bounds.

Proof Sketch

Uses the exponential moment method directly. For bounded losses, the cumulant generating function satisfies logE[eλ(R)]log(1R+Reλ)\log \mathbb{E}[e^{-\lambda(\ell - R)}] \leq \log(1 - R + Re^{-\lambda}). Combine with Donsker-Varadhan and take expectations over PP.

Why It Matters

In the race to produce non-vacuous bounds for deep networks, Catoni's bound consistently produces tighter numerical values than McAllester's. Dziugaite and Roy (2017) used optimized Catoni bounds to get the first non-vacuous generalization bound for a neural network on MNIST.

Failure Mode

Same prior-dependence issue as McAllester. The λ\lambda parameter must be chosen or optimized, adding another degree of freedom. For very small training error, the bound approaches the KL term divided by nn, which is the irreducible cost of choosing the posterior.

Why PAC-Bayes Is Competitive for Deep Learning

The key insight is that overparameterized networks may have enormous raw capacity while still landing in solutions with modest effective complexity relative to a sensible prior. PAC-Bayes captures that through several complementary mechanisms.

Compression view. If QQ is a Gaussian around the learned weights and PP is a broader Gaussian around initialization, then KL(QP)\mathrm{KL}(Q \| P) acts like the number of bits needed to describe the solution relative to initialization. This is why PAC-Bayes and compression-based generalization often feel like two views of the same story.

Noise robustness. If the predictor keeps working when you perturb the weights, then you can choose a broader posterior QQ. Broader posteriors are cheaper in KL, so noise-robust solutions are easier to certify. This is the rigorous core behind the "flat minima" intuition, though PAC-Bayes is more careful than naive flatness metrics.

Reference-relative complexity. A network with ten million parameters can still have a small PAC-Bayes complexity if it stays close to a good prior. Parameter count by itself is not the operative quantity.

Data-dependent priors. If you use one data split to build the prior and a disjoint split to certify the posterior, the prior can sit much closer to the eventual solution. Ambroladze, Parrado-Hernandez, and Shawe-Taylor (2007) formalized this idea.

What Practitioners Actually Optimize

In modern PAC-Bayes work, the goal is usually not to apply the theorem after training as an afterthought. Instead, people optimize a surrogate objective that balances:

  • low empirical risk under posterior noise
  • low KL to the chosen prior

For Gaussian posteriors over network weights, this means learning both a center and a scale. The center wants to fit the data. The scale wants to stay as broad as possible without destroying the empirical risk. That tradeoff is exactly why the bound is interesting: it turns "robust to perturbations" into a quantitative certificate.

This is also why Catoni-style bounds matter in practice. Once you are numerically optimizing a certificate, constants and square roots stop being cosmetic. They can be the difference between vacuous and non-vacuous.

From Stochastic to Deterministic Predictors

The theorem directly certifies the Gibbs classifier that samples hQh \sim Q at prediction time. But most deployed networks are deterministic. So why does PAC-Bayes still matter?

There are two answers.

First, some papers certify the stochastic network itself. This is already meaningful: it shows that a noisy version of the trained predictor generalizes.

Second, if predictions are stable under posterior noise, then the deterministic center of QQ and the stochastic predictor often behave similarly on most examples. PAC-Bayes does not give that deterministic guarantee for free, but it provides the right quantitative object: a posterior that is both accurate and robust to perturbation. In binary classification there are also PAC-Bayesian majority-vote bounds, but the Gibbs view is the simplest entry point.

Comparison with Classical Bounds

Bound typeDepends onDeep net behavior
VC dimensionNumber of parametersVacuous (billions of params)
Rademacher complexityNorm constraints on weightsOften vacuous for practical norms
PAC-BayesKL(posterior, prior)Non-vacuous with good prior
Algorithmic stabilitySensitivity to one sampleBounded for SGD with early stopping

PAC vs. PAC-Bayes: When Each One Wins

PAC-Bayes is not a replacement for classical PAC bounds. They occupy different regimes.

QuestionClassical PACPAC-Bayes
What does the bound depend on?VC / Rademacher complexity of the whole classKL between learned posterior and chosen prior
Sample complexity in nnΘ(dVC/ϵ2)\Theta(d_{\text{VC}}/\epsilon^2)Θ((KL+log(1/δ))/n)\Theta((\mathrm{KL} + \log(1/\delta))/n)
Tight in worst case?Yes (matching VC lower bounds)No worst-case lower bound; depends on prior
Useful for overparameterized nets?No (capacity penalty too large)Yes (with a prior near the learned solution)
Requires data-independent prior?NoYes (or split data)
Sample complexity in ϵ\epsilon at fixed KL\mathrm{KL}1/ϵ21/\epsilon^21/ϵ21/\epsilon^2 (same MGF route)

The right way to read PAC-Bayes: it does not improve the 1/ϵ21/\epsilon^2 rate that worst-case PAC theory shows is unimprovable. It improves the complexity term, replacing dVCd_{\text{VC}} (which can be Θ(d)\Theta(d) for a dd-parameter network) with KL(QP)\mathrm{KL}(Q \| P) (which can be much smaller if the prior is well-chosen). Both bounds are vacuous when the complexity term exceeds nn; both are useful when the complexity term is much smaller than nn. PAC-Bayes wins in regimes where the worst-case capacity is huge but the learned solution is close to a sensible prior.

Worked Numeric Example: How a Good Prior Salvages the Bound

The advanced exercise below shows a vacuous bound for a poorly-chosen prior. Here is the same setup with two choices of prior, side by side. Network: d=106d = 10^6 parameters, n=5×104n = 5 \times 10^4 training samples, empirical stochastic risk R^n(Q)=0.02\hat{R}_n(Q) = 0.02, δ=0.05\delta = 0.05.

Prior PPKL(QP)\mathrm{KL}(Q \| P)McAllester penalty (KL+log(n/δ))/(2n)\sqrt{(\mathrm{KL} + \log(n/\delta))/(2n)}BoundVacuous?
N(0,0.1Id)\mathcal{N}(0, 0.1 \cdot I_d), σQ=0.1\sigma_Q = 0.17.0×1057.0 \times 10^52.652.65>1> 1Yes
N(0,0.1Id)\mathcal{N}(0, 0.1 \cdot I_d), σQ=0.32\sigma_Q = 0.321.1×1051.1 \times 10^51.051.05>1> 1Yes
N(winit,0.1Id)\mathcal{N}(w_{\text{init}}, 0.1 \cdot I_d), σQ=0.1\sigma_Q = 0.15×1035 \times 10^30.2240.2240.2440.244No
Data-dependent prior N(whalf,0.05Id)\mathcal{N}(w_{\text{half}}, 0.05 \cdot I_d), σQ=0.05\sigma_Q = 0.055×1025 \times 10^20.0720.0720.0920.092No

Three observations:

  1. Naive prior fails. A spherical Gaussian centered at zero charges the bound for w2\|w\|^2, which scales with the network's parameter count. Vacuous.
  2. Centering the prior at initialization changes the KL term from w2/(2σP2)\|w\|^2/(2\sigma_P^2) to wwinit2/(2σP2)\|w - w_{\text{init}}\|^2/(2\sigma_P^2). If training only moves the weights moderately, the gap is much smaller. Bound becomes non-vacuous.
  3. Data-dependent priors (Ambroladze-Parrado-Hernandez-Shawe-Taylor 2007): split data, train a "warm-start" prior on S1S_1, certify on S2S_2. The KL term shrinks further because ww is even closer to whalfw_{\text{half}} than to winitw_{\text{init}}. This is the route to genuinely tight modern certificates.

The same data, the same loss, the same algorithm — only the prior changed. PAC-Bayes is a framework for converting modeling choices about the prior into rigorous certificates; the modeling choices matter as much as the bound itself.

Common Confusions

Watch Out

PAC-Bayes does not require Bayesian inference

Despite the name, PAC-Bayes bounds are frequentist. The prior PP is not a belief; it is a reference distribution chosen for the bound. The posterior QQ is not computed by Bayes rule; it is any distribution, typically optimized to minimize the bound. You can apply PAC-Bayes to a deterministic network by setting QQ to be a Gaussian centered on the trained weights.

Watch Out

Non-vacuous does not mean tight

The first non-vacuous bound for MNIST (Dziugaite and Roy, 2017) was about 0.16 for a network with test error around 0.02. Non-vacuous means the bound is less than 1 (better than random guessing), which is a milestone but not a precise characterization of generalization.

Watch Out

PAC-Bayes is not the same thing as naive flat-minima folklore

It is tempting to summarize PAC-Bayes as "flat minima generalize." That is too loose. Simple flatness metrics can change under reparameterization and therefore fail as intrinsic complexity measures. PAC-Bayes is more disciplined: you must specify an actual posterior, an actual prior, and an actual KL cost. Flatness only matters insofar as it lets you choose a broad posterior with low empirical stochastic risk.

Watch Out

The prior must be independent of training data

If you set PP after looking at the training data, the bound is invalid. The data-splitting trick (use half the data for PP, half for the bound) is the standard workaround. Forgetting this independence requirement is a common error.

Canonical Examples

Example

Gaussian prior and posterior

Let P=N(0,σP2Id)P = \mathcal{N}(0, \sigma_P^2 I_d) and Q=N(w,σQ2Id)Q = \mathcal{N}(w, \sigma_Q^2 I_d) where ww are the trained weights and dd is the number of parameters. Then:

KL(QP)=d2(σQ2σP21logσQ2σP2)+w22σP2\mathrm{KL}(Q \| P) = \frac{d}{2}\left(\frac{\sigma_Q^2}{\sigma_P^2} - 1 - \log\frac{\sigma_Q^2}{\sigma_P^2}\right) + \frac{\|w\|^2}{2\sigma_P^2}

For σQσP\sigma_Q \ll \sigma_P (narrow posterior, broad prior), this is approximately w22σP2+d2logσP2σQ2\frac{\|w\|^2}{2\sigma_P^2} + \frac{d}{2}\log\frac{\sigma_P^2}{\sigma_Q^2}. The first term penalizes large weights; the second penalizes many parameters that must be specified precisely. If σQ\sigma_Q can be large (the network is insensitive to weight perturbations), the second term shrinks.

Example

How a good prior changes the story

Suppose you use random initialization as the center of the prior. Then the KL term pays for the distance between the learned weights and initialization, plus the precision of the posterior. If training only moves the weights moderately and the loss stays low under noise, the certificate can be surprisingly good even in a huge network.

Now suppose instead that your prior is centered far from any good solution. The posterior must travel a long distance or become extremely sharp to fit the data, and the KL term becomes enormous. This is why prior design is the central craft problem in PAC-Bayes.

Summary

  • PAC-Bayes controls generalization through empirical stochastic risk plus a KL penalty KL(QP)\mathrm{KL}(Q \| P)
  • The framework charges for information added beyond the prior, not for raw parameter count
  • Good certificates come from a well-chosen prior, a broad accurate posterior, and enough data
  • Catoni-style bounds matter because practical PAC-Bayes work is numerical, not just asymptotic
  • The framework gives a rigorous formulation of two folklore intuitions about deep nets — that flat / compressible solutions generalize — but only via specific prior and posterior choices; flatness alone is not the same statement

Exercises

ExerciseCore

Problem

Let P=N(0,Id)P = \mathcal{N}(0, I_d) and Q=N(μ,Id)Q = \mathcal{N}(\mu, I_d) for some μRd\mu \in \mathbb{R}^d. Compute KL(QP)\mathrm{KL}(Q \| P). How does it scale with dd and μ\|\mu\|?

ExerciseAdvanced

Problem

A network has d=106d = 10^6 parameters trained on n=50,000n = 50{,}000 MNIST images with 0-1 loss. You set P=N(0,0.1Id)P = \mathcal{N}(0, 0.1 \cdot I_d) and Q=N(w,0.01Id)Q = \mathcal{N}(w, 0.01 \cdot I_d) where w2=100\|w\|^2 = 100. The empirical risk of QQ (measured by sampling from QQ and averaging) is 0.02. Using McAllester's bound with δ=0.05\delta = 0.05, is the bound non-vacuous?

ExerciseResearch

Problem

Explain why data-dependent priors (using part of the training set to choose PP) can dramatically improve PAC-Bayes bounds. What is the formal justification, and what fraction of data should you allocate to the prior?

References

Canonical:

  • McAllester, "PAC-Bayesian Model Averaging" (1999), Sections 1-3
  • Seeger, "PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification" (2002), Sections 1-2
  • Catoni, PAC-Bayesian Supervised Classification (2007), Chapters 1-2
  • Guedj, "A Primer on PAC-Bayesian Learning" (2019)

Current:

  • Ambroladze, Parrado-Hernandez, Shawe-Taylor, "Tighter PAC-Bayes Bounds" (2007)
  • Dziugaite & Roy, "Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks" (2017)
  • Perez-Ortiz, Rivasplata, Shawe-Taylor, Szepesvari, "Tighter Risk Certificates for Neural Networks" (JMLR 2021)

Next Topics

  • Algorithmic stability: an alternative path to generalization bounds that does not require a prior
  • Implicit bias and modern generalization: why SGD finds solutions with small PAC-Bayes complexity

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

2