Foundations
Expectation, Variance, Covariance, and Moments
Expectation as average value, variance as squared spread, covariance as joint movement, and the moment identities used throughout statistics and ML.
Prerequisites
Why This Matters
Expectation is the average value of a random variable over repeated draws. Variance measures how far that variable tends to move away from its mean. Covariance measures whether two variables move together.
Those three ideas show up constantly in ML. Expected loss is population risk. Variance of stochastic gradients controls how noisy SGD updates are. Covariance matrices describe the geometry of data clouds and appear in PCA, Gaussian processes, whitening, natural gradients, and Kalman filters.
Quick Version
| Object | Plain meaning | What it answers |
|---|---|---|
| long-run average value | Where is the center? | |
| average squared distance from the mean | How spread out is ? | |
| average product of centered deviations | Do and move together? | |
| scale-free covariance | How strong is the linear relationship? | |
| -th moment | What shape information survives averaging? |
Do not treat these as formulas to memorize separately. They are all averages of transformed random variables. Once expectation is understood, variance and covariance are just expectation applied to centered products.
A tiny distribution shows all three ideas
Let take values and with probabilities and .
Variance averages squared deviations from that mean:
The standard deviation is . The mean tells you where the distribution balances; the variance tells you how far the mass sits from that balance point.
Visual: Covariance Geometry
Core Definitions
Expectation
For a discrete random variable: . For a continuous random variable with density : . The expectation exists when the sum or integral is absolutely convergent.
More generally, for a measurable function : .
Variance
The variance measures spread around the mean:
The second form (computational formula) is often easier to use. The standard deviation is .
Covariance
The covariance measures linear association between two random variables:
. For a random vector , the covariance matrix has entries .
Correlation
The Pearson correlation normalizes covariance to :
iff is an affine function of almost surely. means uncorrelated (not necessarily independent).
Moments
The -th moment of is . The -th central moment is . The third central moment (normalized) is skewness. The fourth (normalized) is kurtosis. Heavy-tailed distributions have large kurtosis.
Covariance vs Correlation
Covariance and correlation both measure linear association, but they serve different purposes. Covariance is a bilinear form that participates in algebraic computations: the variance of a sum formula, the covariance matrix in PCA, the Kalman gain equation. Its magnitude depends on the scale of the variables, so is meaningless without knowing the units.
Correlation normalizes away scale: regardless of units. Use correlation for interpretation (how strong is the linear relationship?) and covariance for computation (what is the variance of a portfolio return?).
Two critical points. First, does not imply independence. It only rules out linear dependence. Second, correlation measures linear association only. Variables with strong nonlinear dependence can have . For broader dependence measures, see mutual information or rank correlations (Spearman, Kendall).
Key Properties
Linearity of Expectation
Statement
This extends to any finite sum: .
Exact statement
LaTeX source for copy/export
\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]Intuition
No independence or uncorrelatedness is required. This holds for arbitrary dependence structure. It is the single most-used property in probabilistic analysis.
Proof Sketch
For continuous random variables with joint density :
Split the integral: . The first integral is (integrate out to get the marginal), the second is .
Why It Matters
Linearity makes expected value tractable even for complex random variables. To compute where counts something complicated, decompose into indicator random variables and sum . This trick solves problems in combinatorics, algorithm analysis, and randomized methods where computing the joint distribution would be intractable.
Failure Mode
Linearity does not hold for variance, entropy, or other nonlinear functionals of distributions. in general.
Variance scaling: . Adding a constant shifts the mean but does not change spread. Scaling by scales variance by .
Covariance bilinearity: . Covariance is bilinear, making it an inner product on the space of zero-mean, finite-variance random variables.
Law of Total Variance
Law of Total Variance (Eve's Law)
Statement
Exact statement
LaTeX source for copy/export
\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(\mathbb{E}[X \mid Y])Intuition
Total variance decomposes into two sources. is the average variance within each level of (unexplained variance). is the variance of the conditional mean across levels of (explained variance). If knowing perfectly predicts , the first term is zero. If is useless, the second term is zero.
Proof Sketch
Start from . Apply the law of total expectation to . Write . Substitute and regroup to get . The last two terms equal .
Why It Matters
This decomposition is the theoretical basis for ANOVA, the bias-variance decomposition, and hierarchical models. In random effects models, it separates within-group and between-group variation.
Failure Mode
Requires . The conditional variance is itself a random variable (a function of ), not a number.
Chebyshev's Inequality
Chebyshev's Inequality
Statement
Equivalently, for .
Exact statement
LaTeX source for copy/export
\Pr(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}Intuition
A random variable with small variance cannot deviate far from its mean with high probability. The bound is distribution-free: it holds for any distribution with finite variance.
Proof Sketch
Apply Markov's inequality to the nonneg random variable :
Since iff , the result follows.
Why It Matters
Chebyshev is the simplest concentration inequality. It proves the weak law of large numbers in two lines: apply Chebyshev to with , getting .
Failure Mode
The bound is loose for specific distributions. For Gaussians, , while Chebyshev gives . Tighter bounds require distributional assumptions; see concentration inequalities for Hoeffding, Bernstein, and sub-Gaussian bounds.
Higher Moments and Moment Generating Functions
The -th moment and the -th central moment capture progressively finer distributional information.
Skewness (third standardized central moment): . Positive skewness indicates a right tail heavier than the left. Zero for any symmetric distribution.
Kurtosis (fourth standardized central moment): . The Gaussian has . Excess kurtosis measures tail heaviness relative to the Gaussian. Heavy-tailed distributions (relevant to financial returns, gradient noise) have large excess kurtosis.
Moment generating function (MGF): , defined for in a neighborhood of zero. When it exists, the MGF uniquely determines the distribution. Its utility: , so all moments are encoded in one function. The MGF of a sum of independent random variables is the product of their MGFs, which is the standard tool for proving the central limit theorem. For distributions where the MGF does not exist (e.g., Cauchy, log-normal), use the characteristic function instead, which always exists.
Main Theorems
Variance of a Sum
Statement
If are pairwise uncorrelated, this reduces to .
Exact statement
LaTeX source for copy/export
\operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \operatorname{Var}(X_i) + 2\sum_{i < j}\operatorname{Cov}(X_i, X_j)Intuition
Variance of a sum depends on both individual variances and how the variables co-vary. Positive correlations inflate the total variance; negative correlations reduce it.
Proof Sketch
Let . Then . Expanding the square gives by linearity of expectation.
Why It Matters
For i.i.d. random variables, . This is why averaging reduces noise and is the basis for the convergence rate in the central limit theorem. In SGD, minibatch averaging reduces gradient variance by a factor of the batch size.
Failure Mode
Requires finite second moments. For heavy-tailed distributions (e.g., Cauchy; see common probability distributions), variance is infinite and this formula is meaningless. Pairwise uncorrelated does not imply independent: the simplification holds under the weaker pairwise uncorrelated condition, but other properties (e.g., concentration inequalities) may need full independence.
Common Confusions
Uncorrelated does not imply independent
and . Then , so and are uncorrelated. But is a deterministic function of , so they are maximally dependent.
E[XY] = E[X]E[Y] requires independence (or uncorrelatedness)
The factorization holds when are uncorrelated (equivalently, ). Independence implies uncorrelatedness, but not vice versa. For nonlinear functions: requires independence, not just uncorrelatedness.
Variance is not linear
unless and are uncorrelated. The cross-term is often forgotten.
Exercises
Problem
Let be i.i.d. with mean and variance . Compute and where .
Problem
Let and have finite second moments. Prove that , i.e., .
References
Canonical:
- Grimmett & Stirzaker, Probability and Random Processes (2020), Chapter 3
- Casella & Berger, Statistical Inference (2002), Chapter 2
- Billingsley, Probability and Measure (1995), Chapters 5 and 21 (expectation, moments, MGFs)
- Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapter XV (moments, characteristic functions)
For ML context:
- Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6
- Blitzstein & Hwang, Introduction to Probability (2019), Chapters 4 and 7 (expectation, joint distributions, covariance)
Last reviewed: April 14, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Common Probability Distributionslayer 0A · tier 1
- Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
- Random Variableslayer 0A · tier 1
- Triangular Distributionlayer 0A · tier 2
Derived topics
28- Central Limit Theoremlayer 0B · tier 1
- Law of Large Numberslayer 0B · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
- Analysis of Variancelayer 1 · tier 1
- Concentration Inequalitieslayer 1 · tier 1
+23 more on the derived-topics page.
Graph-backed continuations