Skip to main content

Applied ML

Predictive Coding and Autoencoders in the Brain

Hierarchical predictive coding (Rao-Ballard) and the free-energy principle as biological analogs of amortized variational inference and approximate backprop.

AdvancedTier 3CurrentReference~15 min

Why This Matters

Cortex is metabolically expensive and bandwidth-limited. Predictive coding proposes that, rather than transmitting raw activations upward, each cortical area sends only the residual its higher-area model failed to predict. Top-down projections carry predictions, bottom-up projections carry prediction errors, and learning reduces error. This single architectural commitment yields a sharp algorithmic story: perception is approximate inference, and learning is gradient descent on a free-energy bound.

For ML readers the link is direct. The Rao-Ballard hierarchy is structurally a stacked autoencoder with feedback connections. The free-energy principle is the variational ELBO with biological branding. Whittington and Bogacz showed that, under specific assumptions, predictive-coding updates approximate backprop arbitrarily well using only local Hebbian-like rules. Whether the cortex actually implements any of this is contested, but the math is shared with the generative models we already train.

Core Ideas

Rao and Ballard (1999, Nat. Neurosci. 2(1)). Each layer ll maintains a state rlr_l and predicts the layer below via a generative weight matrix WlW_l: r^l1=f(Wlrl)\hat r_{l-1} = f(W_l\, r_l). The prediction error is εl1=rl1r^l1\varepsilon_{l-1} = r_{l-1} - \hat r_{l-1}. State updates and weights minimize a sum of squared errors weighted by precision (inverse variance) at each level:

F=l12σl2εl2.F = \sum_l \frac{1}{2\sigma_l^2}\, \lVert \varepsilon_l \rVert^2.

Their model trained on natural-image patches reproduced extra-classical receptive field effects (end-stopping, contextual modulation) without putting them in by hand.

Free-energy principle (Friston, 2010, Nat. Rev. Neurosci. 11(2)). Reframes predictive coding as variational inference on a generative model. The brain maintains a recognition density q(zx)q(z \mid x) and minimizes variational free energy F[q,x]=Eq[logp(x,z)]H[q]F[q, x] = \mathbb{E}_q[-\log p(x, z)] - \mathcal{H}[q], which is an upper bound on surprise logp(x)-\log p(x) and equivalent to the negative ELBO. Action also minimizes free energy, giving a unified account of perception, learning, and behavior. The unification is conceptually elegant; many specific predictions are difficult to falsify.

Connection to amortized variational inference. A VAE encoder qϕ(zx)q_\phi(z \mid x) amortizes the cost of inferring zz across data points. The Rao-Ballard hierarchy plays the same role with a fixed iterative inference procedure (a few steps of state updates per stimulus) rather than a learned encoder. Both schemes optimize the same free-energy objective; they differ in how inference is implemented.

Whittington and Bogacz (2017, Neural Comput. 29(5)). Construct a predictive-coding network in which top-down predictions and bottom-up errors evolve to a fixed point, then synaptic updates use only the locally available error and activity at each synapse. Under linear or near-linear regimes, the learned weight changes converge to backprop weight changes. This makes predictive coding the most concrete biologically plausible approximation to backprop currently on the table.

Definition

Prediction Error Unit

In a predictive-coding hierarchy, a prediction error unit stores the difference between an observed lower-level state and the top-down prediction of that state: εl=rlr^l\varepsilon_l = r_l - \hat r_l. Learning and inference push these errors toward zero, weighted by their precision.

Proposition

Local Backprop Approximation

Statement

Under the predictive-coding fixed-point assumptions, local updates using activity and prediction error approximate the weight changes produced by backpropagation.

Intuition

Backprop sends error derivatives backward. Predictive coding instead lets top-down predictions and bottom-up errors settle until each synapse sees a local error signal aligned with the gradient it would have received.

Failure Mode

The approximation can fail far from the fixed point, with strongly nonlinear dynamics, or when the biological circuit does not provide the required separated error and state populations.

ExerciseCore

Problem

In the objective F=lεl2/(2σl2)F = \sum_l \lVert \varepsilon_l \rVert^2/(2\sigma_l^2), what happens to a layer's influence when σl2\sigma_l^2 is made smaller?

Common Confusions

Watch Out

Variational free energy is not physical free energy

Friston's free energy is the variational free energy from Bayesian statistics, which has the same algebraic form as the Helmholtz free energy from statistical mechanics but tracks belief states, not particle ensembles. Treat the name as a useful analogy, not a thermodynamic identity.

Watch Out

Predictive coding is a candidate cortical account

Some predictions match neurophysiology (precision-weighted error responses, expectation-modulated activity), but several core claims (separate error and prediction populations, hierarchical organization of generative models) lack clean experimental confirmation. The framework is a candidate, not a settled account.

References

Foundational:

  • Rao & Ballard, "Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects" (Nat. Neurosci. 2(1), 1999). The hierarchical predictive-coding model that reproduced end-stopping without putting it in by hand.
  • Friston, "The free-energy principle: a unified brain theory?" (Nat. Rev. Neurosci. 11(2), 2010). Reframes predictive coding as variational inference and extends it to action.
  • Whittington & Bogacz, "An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity" (Neural Comput. 29(5), 2017). The local-update result connecting predictive coding to backprop.

Variational inference connection:

  • Kingma & Welling, "Auto-Encoding Variational Bayes" (ICLR 2014; arXiv:1312.6114). The amortized inference baseline the iterative predictive-coding scheme is compared against.
  • Marino, "Predictive Coding, Variational Autoencoders, and Biological Connections" (Neural Comput. 34(1), 2022; arXiv:2011.07464). Direct mathematical bridge between the two frameworks.

Modern theory and implementations:

  • Millidge, Tschantz & Buckley, "Predictive Coding Approximates Backprop along Arbitrary Computation Graphs" (Neural Comput. 34(6), 2022; arXiv:2006.04182). Generalizes Whittington-Bogacz beyond MLPs.
  • Salvatori, Song, Yordanov, Millidge, Sha, Emde, Xu, Bogacz & Lukasiewicz, "A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks" (ICLR 2024; arXiv:2212.00720). Practical training of deep PCNs.
  • Song, Lukasiewicz, Xu & Bogacz, "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation" (Nat. Neurosci. 27, 2024). Recent biological-plausibility result for PCN learning.
  • Pinchetti, Salvatori, Yordanov, Millidge, Song & Lukasiewicz, "Predictive Coding beyond Gaussian Distributions" (NeurIPS 2022; arXiv:2211.03481). Extending the Gaussian-noise assumption.

Critique and falsifiability:

  • Aitchison & Lengyel, "With or without you: predictive coding and Bayesian inference in the brain" (Curr. Opin. Neurobiol. 46, 2017). Distinguishes the algorithmic claims from the broader Bayesian-brain framing.
  • Bogacz, "A tutorial on the free-energy framework for modelling perception and learning" (J. Math. Psychol. 76, 2017). Step-by-step derivation that surfaces what the framework does and does not commit to.

Related Topics

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

1

Graph-backed continuations