Learning Theory Core
Kolmogorov Complexity and MDL
Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.
Prerequisites
Why This Matters
The connection between compression and learning is deep: a model that compresses data well must have captured its regularities, and those regularities are exactly what you need for prediction. Kolmogorov complexity gives the theoretical ideal for compression. MDL (Minimum Description Length) gives a practical principle for model selection based on the same idea.
This perspective offers an alternative to the classical bias-variance framework. Instead of asking "how complex is the hypothesis class?", you ask "how short is the total description of the data using this model?"
Kolmogorov Complexity
Kolmogorov Complexity
The Kolmogorov complexity of a string (with respect to a fixed universal Turing machine ) is:
where is the length of program in bits. It is the length of the shortest program that outputs and halts.
Key properties:
- Subadditivity: .
- Invariance: changing the universal Turing machine changes by at most an additive constant (independent of ).
- Most strings are incompressible: for any length , at most strings of length have . So at most a fraction of strings can be compressed by bits.
Incomputability of Kolmogorov Complexity
Statement
There is no algorithm that, given an arbitrary string , computes .
Intuition
A computable would be self-defeating. Given any threshold , you could write a short program that searches through strings in lexicographic order and outputs the first one with — call its output . By construction , but the search program itself has length only (it just stores the constant and a fixed search routine), which for large is far below . That contradicts . This is the Berry paradox made formal, and it is the same self-reference that powers the undecidability of halting — but the argument is direct, not a reduction from halting.
Proof Sketch
Suppose is computable. Define a program that enumerates all strings with and outputs the first such string found. Program has length (it just encodes the constant ). But outputs a string with , while itself is a program of length that produces . For large enough , this gives , a contradiction.
Why It Matters
Incomputability is not a technicality. It means there is no algorithm that can determine the true complexity of data. Any practical approach to compression-based learning must use an approximation. MDL, BIC, and practical compression algorithms are all such approximations.
Failure Mode
You can compute upper bounds on (any specific compression of gives one), but you can never certify that your bound is tight. There is no algorithm that, given and , decides whether .
Minimum Description Length
Minimum Description Length Principle
The MDL principle selects the model that minimizes the total description length:
where is the length (in bits) of encoding the model and is the length of encoding the data given the model. This is a two-part code: first describe the model, then describe the data using the model.
The first term penalizes model complexity (more parameters require more bits). The second term penalizes poor fit (if the model does not explain the data well, the residuals require many bits). MDL balances these automatically.
Connection to coding theory. By the Shannon-Kraft inequality, code lengths correspond to probability distributions: . So minimizing description length is equivalent to maximizing a penalized likelihood. MDL with a universal code for the model class recovers BIC asymptotically.
MDL Consistency
Statement
Assume the true data-generating distribution lies in some model in the (countable, identifiable) model class, the model code satisfies the Kraft inequality , and . Under a two-part MDL code, as the sample size , the MDL-selected model converges (in probability) to the simplest model in the class that contains .
Intuition
Kraft alone is not enough: it only ensures the model code is decodable. The work is done by realizability and identifiability — a wrong model pays a per-sample cross-entropy excess that grows linearly with , while the correct model pays only the fixed cost plus an parameter-encoding term. For large the linear penalty dominates the logarithmic one, so wrong models lose. Among models that contain , the Kraft-constrained code length breaks ties in favor of the simplest.
Proof Sketch
For any model that does not contain the true distribution , the average code length converges to the cross-entropy . The true model achieves plus a penalty term. For large , the excess cross-entropy of the wrong model exceeds the complexity penalty of the right one.
Why It Matters
This justifies MDL as a model selection principle: it is consistent (picks the right model eventually) and implements Occam's razor automatically. You do not need to specify a separate regularization parameter.
Failure Mode
Consistency is an asymptotic property. For finite samples, MDL can select a model that is too simple (if the complexity penalty overwhelms the data-fit term) or too complex (if the model class encoding is poorly chosen). The result also requires the true distribution to lie within the model class, which is never exactly true in practice.
Connection to Learning Theory
The compression view of learning: if a learning algorithm can compress the training data (describe it in fewer bits than the raw data), then the patterns it found are likely real and will generalize. More precisely, if a hypothesis allows encoding the training labels in bits, and this is less than bits (the cost of encoding labels with no model), then has captured structure.
This intuition is made precise by occam-style PAC bounds. For a finite or prefix-coded hypothesis class with code length in bits, with probability at least over the sample,
in the agnostic setting (or the analogous rate in the realizable case). What "compresses by bits" buys is a smaller in the numerator, hence a tighter excess-risk bound. There is no general rule: the right form is the Shawe-Taylor / Langford-style PAC bound above, with hypothesis cost measured by code length.
Why MDL Is Hard to Operationalize
The elegance of MDL hides practical difficulties:
- Code design: the encoding of models and data given models is not unique. Different encodings lead to different model selections.
- Continuous parameters: encoding real-valued parameters requires discretization, and the grid resolution affects the result.
- Model class: MDL selects among a given collection of models. It cannot search over all possible models (that would require computing ).
In practice, BIC and cross-validation are used more often than explicit MDL, partly because they avoid the coding construction.
Common Confusions
Kolmogorov complexity is not Shannon entropy
Shannon entropy is a property of a distribution. Kolmogorov complexity is a property of a specific string. For a random string drawn from distribution , (up to ), but individual strings can have far from .
MDL is not just penalized likelihood
While MDL can be viewed as penalized likelihood (since code lengths correspond to log-probabilities), the MDL philosophy is different. Penalized likelihood assumes a true model exists and tries to find it. MDL makes no such assumption: it seeks the best compression regardless of whether any model is "true." In the misspecified setting, MDL selects the model that compresses best, which may not be the most "likely" in any generative sense.
Canonical Examples
MDL for polynomial regression
Suppose you fit polynomials of degree 1, 2, 5, and 20 to 50 data points. The degree-20 polynomial fits the training data perfectly (zero residuals, so ), but encoding 21 coefficients to high precision costs many bits ( is large). A degree-2 polynomial has small (3 coefficients) but nonzero residuals that require some bits to encode. MDL selects the degree that minimizes the total: if the true relationship is quadratic, degree 2 wins for large enough .
Exercises
Problem
The string (the pattern "01" repeated 500 times, total length 1000 bits) has . Give an upper bound on and explain why. A random 1000-bit string has . Why?
Problem
In two-part MDL, you encode a polynomial regression model by specifying the degree and the coefficients. Suppose each coefficient is quantized to bits. Write the total description length as a function of , , (sample size), and the residual sum of squares . Under what conditions does MDL prefer degree 2 over degree 10?
References
Canonical:
- Li & Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications (2008), Chapters 1-2
- Grunwald, The Minimum Description Length Principle (2007), Chapters 1-3, 17
Current:
-
Rissanen, "Modeling by Shortest Data Description" (1978)
-
Barron, Rissanen, Yu, "The Minimum Description Length Principle in Coding and Modeling" (1998)
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2-6
-
Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4
Next Topics
- AIC and BIC: alternative model selection criteria with connections to MDL
- Algorithmic stability: another route from simplicity to generalization
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
0No direct prerequisites are declared; this is treated as an entry point.
Derived topics
2- AIC and BIClayer 2 · tier 1
- Algorithmic Stabilitylayer 3 · tier 1
Graph-backed continuations