Skip to main content

Foundations

Moment Generating Functions

Moment generating functions encode moments, control light-tailed behavior, and power Chernoff bounds, sub-Gaussian estimates, and exponential-family theory.

CoreTier 2StableSupporting~35 min

Why This Matters

The moment generating function is the main workhorse connecting light-tailed probability distributions to concentration inequalities. When the MGF exists in a neighborhood of zero (sub-Gaussian, sub-exponential, bounded), it powers the Chernoff method, exponential tilting, and the sub-Gaussian / sub-exponential machinery. When it does not exist (Cauchy, heavy-tailed power laws), the characteristic function is the strictly more general tool, and concentration results require different machinery (Chebyshev, truncation, Nemirovski-style norms).

Hide overviewShow overview
Five-panel infographic: definition of the MGF M_X(t) = E[e^{tX}], why it generates all moments via differentiation at zero, examples for Gaussian, exponential, Poisson distributions, the uniqueness theorem connecting MGF to distribution, and why Chernoff bounds use the MGF (exponential Markov yields exponential tail decay).
The moment generating function is a single function whose derivatives at zero recover every moment of a distribution. Its log is the cumulant generating function.

The Chernoff method is: apply Markov's inequality to etXe^{tX} and optimize over tt. This is an MGF computation.

A random variable XX is sub-Gaussian with parameter σ\sigma if and only if the MGF of the centered variable satisfies E ⁣[et(XE[X])]eσ2t2/2\mathbb{E}\!\left[e^{t(X - \mathbb{E}[X])}\right] \leq e^{\sigma^2 t^2 / 2} for all tRt \in \mathbb{R}. The simpler form MX(t)eσ2t2/2M_X(t) \leq e^{\sigma^2 t^2 / 2} holds only when E[X]=0\mathbb{E}[X] = 0. Within the sub-Gaussian world, the entire concentration story reduces to bounding this centered MGF.

Exponential families are distributions whose density is exp(θTT(x)A(θ))\exp(\theta^T T(x) - A(\theta)), relying on properties of the exponential function. The function A(θ)A(\theta) is the log-MGF of the sufficient statistic.

Want the moving picture first? The Chernoff / MGF Tilt Lab lets you drag the threshold and tilt parameter yourself, so you can see when the MGF gives a real certificate and when heavy tails break the method.

Quick Version

ObjectPlain meaningWhy it matters
MX(t)=E[etX]M_X(t)=\mathbb E[e^{tX}]Exponential average of XXEncodes moments and tail sensitivity
KX(t)=logMX(t)K_X(t)=\log M_X(t)Log-MGF / cumulant generatorAdds cleanly under independent sums
Chernoff methodApply Markov to etXe^{tX}Turns MGF control into tail bounds
Failure modeMGF infinite for t>0t>0Heavy tails need different tools

The MGF is not just "a fancy way to store moments." It is the object that lets light-tailed distributions talk to concentration inequalities.

Visual Intuition

theorem visual

Chernoff Tilt

The upper-tail bound becomes tight by choosing the slope that best tilts the log-MGF.

t = 1.4

values

The best exponential upper tail bound comes from choosing the optimizer that minimizes the tilted intercept.

: log-MGFAmber: tilted lineGreen: optimizer and gap

Read the picture this way:

  • the blue curve is the log-MGF / cumulant generator
  • the amber line is the exponential tilt you chose
  • the vertical gap is the rate that powers the upper-tail bound

This is the geometric core of the Chernoff method. If the log-MGF is finite and well behaved, you can tilt and optimize. If the MGF blows up, the whole certificate disappears.

Core Definitions

Definition

Moment Generating Function

The moment generating function of a random variable XX is:

MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]

defined for all tRt \in \mathbb{R} where this expectation is finite. The MGF may not exist for all tt; when it exists in an open interval around 0, it determines the distribution uniquely.

Extracting Moments

The kk-th moment of XX is the kk-th derivative of MXM_X at zero:

E[Xk]=MX(k)(0)\mathbb{E}[X^k] = M_X^{(k)}(0)

This follows from differentiating under the expectation:

MX(k)(t)=E[XketX]M_X^{(k)}(t) = \mathbb{E}[X^k e^{tX}]

and evaluating at t=0t = 0. The interchange of differentiation and expectation is valid when MXM_X exists in a neighborhood of tt.

In particular: E[X]=MX(0)\mathbb{E}[X] = M_X'(0) and E[X2]=MX(0)\mathbb{E}[X^2] = M_X''(0).

Main Theorems

Theorem

MGF Uniqueness Theorem

Statement

If two random variables XX and YY have moment generating functions MX(t)M_X(t) and MY(t)M_Y(t) that are finite and equal for all tt in some open interval (δ,δ)(-\delta, \delta) with δ>0\delta > 0, then XX and YY have the same distribution.

Intuition

The MGF encodes the entire distribution, not just the moments. If two distributions agree on their MGFs in a neighborhood of zero, they must be the same distribution. This is stronger than moment matching: there exist distinct distributions with identical moments of all orders, but they cannot have identical MGFs in a neighborhood of zero.

Proof Sketch

The MGF is related to the characteristic function φX(t)=E[eitX]\varphi_X(t) = \mathbb{E}[e^{itX}] by analytic continuation. If MX(t)M_X(t) is finite on (δ,δ)(-\delta, \delta), the characteristic function extends analytically to a strip in the complex plane. By the uniqueness theorem for characteristic functions (Levy inversion), the distribution is determined.

Why It Matters

This theorem justifies the "MGF technique" for identifying distributions. If you compute the MGF of a sum of independent Gaussians and recognize it as the MGF of another Gaussian, you can conclude the sum is Gaussian. This approach is cleaner than convolution arguments.

Failure Mode

The MGF must exist in a neighborhood of 0, not just at 0 (where it always equals 1). Heavy-tailed distributions like the Cauchy distribution have no MGF. For those distributions, use the characteristic function instead, which always exists.

Proposition

MGF of Independent Sum

Statement

If XX and YY are independent random variables whose MGFs exist, then:

MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

Intuition

Independence means E[g(X)h(Y)]=E[g(X)]E[h(Y)]\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)] \mathbb{E}[h(Y)]. Apply this with g(X)=etXg(X) = e^{tX} and h(Y)=etYh(Y) = e^{tY}.

Proof Sketch

MX+Y(t)=E[et(X+Y)]=E[etXetY]=E[etX]E[etY]=MX(t)MY(t)M_{X+Y}(t) = \mathbb{E}[e^{t(X+Y)}] = \mathbb{E}[e^{tX} e^{tY}] = \mathbb{E}[e^{tX}] \mathbb{E}[e^{tY}] = M_X(t) M_Y(t), where the third equality uses independence.

Why It Matters

This is why MGFs are the natural tool for sums of independent variables. Addition of random variables corresponds to multiplication of MGFs. Taking logs: the cumulant generating function logMX(t)\log M_X(t) is additive for independent sums.

Failure Mode

Fails without independence. For dependent variables, E[etXetY]E[etX]E[etY]\mathbb{E}[e^{tX}e^{tY}] \neq \mathbb{E}[e^{tX}]\mathbb{E}[e^{tY}] in general.

Canonical Examples

Example

MGF of a Gaussian

Let XN(μ,σ2)X \sim N(\mu, \sigma^2). Then:

MX(t)=E[etX]=exp(μt+σ2t22)M_X(t) = \mathbb{E}[e^{tX}] = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

This exists for all tRt \in \mathbb{R}. Setting μ=0\mu = 0: MX(t)=exp(σ2t2/2)M_X(t) = \exp(\sigma^2 t^2 / 2), which is the defining condition for sub-Gaussian random variables with parameter σ\sigma.

Example

The Chernoff method in one line

For any t>0t > 0: P(Xa)=P(etXeta)etaMX(t)P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq e^{-ta} M_X(t). The first step is monotonicity of exp\exp; the second is Markov's inequality. Optimize over t>0t > 0 to get the tightest bound. This is the entire Chernoff method.

MGF Reference Table

The MGF of standard distributions, with domain of finiteness. Here tRt \in \mathbb{R} unless restricted.

DistributionMX(t)M_X(t)Domain (finite)
Bernoulli(p)(p)1p+pet1 - p + p e^tall tt
Binomial(n,p)(n, p)(1p+pet)n(1 - p + p e^t)^nall tt
Poisson(λ)(\lambda)exp(λ(et1))\exp(\lambda (e^t - 1))all tt
Geometric(p)(p), support {1,2,}\{1, 2, \ldots\}pet/(1(1p)et)p e^t / (1 - (1-p) e^t)t<log(1p)t < -\log(1-p)
Exponential(λ)(\lambda), rateλ/(λt)\lambda / (\lambda - t)t<λt < \lambda
Gamma(α,β)(\alpha, \beta), rate β\beta(β/(βt))α(\beta / (\beta - t))^\alphat<βt < \beta
Chi-squared(k)(k)(12t)k/2(1 - 2t)^{-k/2}t<1/2t < 1/2
Normal(μ,σ2)(\mu, \sigma^2)exp(μt+σ2t2/2)\exp(\mu t + \sigma^2 t^2 / 2)all tt
Uniform(a,b)(a, b)(etbeta)/(t(ba))(e^{tb} - e^{ta}) / (t(b - a)) for t0t \neq 0, 11 at t=0t = 0all tt

Bernoulli, Binomial, Poisson, and Normal have MGFs finite on all of R\mathbb{R}. Geometric, Exponential, Gamma, and Chi-squared are only finite on a left half-line. Heavy-tailed distributions (Cauchy, Pareto with small shape, lognormal) have MX(t)=M_X(t) = \infty for all t>0t > 0.

Cumulants and Cumulant Generating Function

Definition

Cumulant Generating Function

The cumulant generating function is KX(t)=logMX(t)K_X(t) = \log M_X(t), defined on the open set where MXM_X is finite. The nn-th cumulant is κn=KX(n)(0)\kappa_n = K_X^{(n)}(0).

The low-order cumulants recover familiar summaries:

κ1=E[X],κ2=Var(X),κ3=E[(Xμ)3],κ4=E[(Xμ)4]3σ4.\kappa_1 = \mathbb{E}[X], \quad \kappa_2 = \mathrm{Var}(X), \quad \kappa_3 = \mathbb{E}[(X - \mu)^3], \quad \kappa_4 = \mathbb{E}[(X - \mu)^4] - 3 \sigma^4.

Skewness is κ3/κ23/2\kappa_3 / \kappa_2^{3/2} and excess kurtosis is κ4/κ22\kappa_4 / \kappa_2^2. A Gaussian has κn=0\kappa_n = 0 for all n3n \geq 3, so nonzero higher cumulants measure non-Gaussianity.

The defining property: cumulants add under independent sums. If XX and YY are independent, then KX+Y(t)=KX(t)+KY(t)K_{X+Y}(t) = K_X(t) + K_Y(t), hence κn(X+Y)=κn(X)+κn(Y)\kappa_n(X + Y) = \kappa_n(X) + \kappa_n(Y) for every nn. This is cleaner than the moment rule, which requires binomial sums. Cumulant additivity is exploited in independent component analysis (ICA), where sources are recovered by maximizing the absolute fourth cumulant of projections (an explicit non-Gaussianity measure) of the observed mixture.

Multivariate MGF

Definition

Multivariate MGF

For a random vector XRdX \in \mathbb{R}^d, the multivariate moment generating function is:

MX(t)=E[exp(tX)],tRd,M_X(t) = \mathbb{E}[\exp(t^\top X)], \quad t \in \mathbb{R}^d,

on the set where the expectation is finite.

Key facts parallel the scalar case. If MX(t)M_X(t) is finite in an open neighborhood of the origin in Rd\mathbb{R}^d, it determines the joint distribution of XX uniquely. Mixed moments are recovered by partial derivatives: E[X1k1Xdkd]=k1++kdMX/t1k1tdkd\mathbb{E}[X_1^{k_1} \cdots X_d^{k_d}] = \partial^{k_1 + \cdots + k_d} M_X / \partial t_1^{k_1} \cdots \partial t_d^{k_d} evaluated at t=0t = 0. For a multivariate Gaussian XN(μ,Σ)X \sim N(\mu, \Sigma): MX(t)=exp(tμ+tΣt/2)M_X(t) = \exp(t^\top \mu + t^\top \Sigma t / 2), valid for all tRdt \in \mathbb{R}^d. If all linear combinations aXa^\top X are Gaussian, XX is jointly Gaussian; this criterion is often easiest to check via the scalar MGF of aXa^\top X.

When the multivariate MGF fails to exist, the multivariate characteristic function φX(t)=E[exp(itX)]\varphi_X(t) = \mathbb{E}[\exp(i t^\top X)] always does and plays the same role.

Connection to Concentration

MGFs power two adjacent machines. The Chernoff bound for P(Xa)P(X \geq a) is obtained by Markov on etXe^{tX} and optimizing t>0t > 0, so every Chernoff bound is an MGF computation. Sub-Gaussian random variables are defined by the MGF inequality MXEX(t)exp(σ2t2/2)M_{X - \mathbb{E}X}(t) \leq \exp(\sigma^2 t^2 / 2) for all tt, and sub-exponential variables by a similar bound on a strip around 0. Concentration for bounded, Lipschitz, and light-tailed functionals reduces to bounding the CGF KXK_X near 0.

Common Confusions

Watch Out

Moments existing does not imply MGF exists

A distribution can have all moments finite yet have no MGF. The lognormal distribution has E[Xk]<\mathbb{E}[X^k] < \infty for all kk but MX(t)=M_X(t) = \infty for all t>0t > 0. The MGF is a stronger condition than having all moments.

Watch Out

MGF vs characteristic function vs cumulant generating function

The MGF is MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]. The characteristic function is φX(t)=E[eitX]\varphi_X(t) = \mathbb{E}[e^{itX}], which always exists. The cumulant generating function (CGF) is KX(t)=logMX(t)K_X(t) = \log M_X(t). The CGF is additive for independent sums. In concentration inequality proofs, you work with the CGF.

Exercises

ExerciseCore

Problem

Compute the MGF of a Bernoulli(pp) random variable. Use it to find E[X]\mathbb{E}[X] and E[X2]\mathbb{E}[X^2].

ExerciseAdvanced

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. N(0,1)N(0, 1). Use MGFs to prove that Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i is distributed as N(0,1/n)N(0, 1/n).

References

Canonical:

  • Casella & Berger, Statistical Inference (2002), Chapter 2.3
  • Billingsley, Probability and Measure (1995), Section 21

Current:

  • Wainwright, High-Dimensional Statistics (2019), Chapters 2-3 (MGFs in concentration)
  • Vershynin, High-Dimensional Probability (2018), Chapter 2 (sub-Gaussian MGF condition)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

12

+7 more on the derived-topics page.