Moment Generating Functions

Sneiderman, Robby

Foundations

Moment Generating Functions

Moment generating functions encode moments, control light-tailed behavior, and power Chernoff bounds, sub-Gaussian estimates, and exponential-family theory.

CoreTier 2StableSupporting~35 min

Prerequisites

Expectation Variance Covariance Moments Common Probability Distributions Exponential Function Properties

Quiz (6)Pulse Check Prereq Map

Why This Matters

The moment generating function is the main workhorse connecting light-tailed probability distributions to concentration inequalities. When the MGF exists in a neighborhood of zero (sub-Gaussian, sub-exponential, bounded), it powers the Chernoff method, exponential tilting, and the sub-Gaussian / sub-exponential machinery. When it does not exist (Cauchy, heavy-tailed power laws), the characteristic function is the strictly more general tool, and concentration results require different machinery (Chebyshev, truncation, Nemirovski-style norms).

Five-panel infographic: definition of the MGF M_X(t) = E[e^{tX}], why it generates all moments via differentiation at zero, examples for Gaussian, exponential, Poisson distributions, the uniqueness theorem connecting MGF to distribution, and why Chernoff bounds use the MGF (exponential Markov yields exponential tail decay). — The moment generating function is a single function whose derivatives at zero recover every moment of a distribution. Its log is the cumulant generating function.

The Chernoff method is: apply Markov's inequality to $e^{tX}$ and optimize over $t$ . This is an MGF computation.

A random variable $X$ is sub-Gaussian with parameter $\sigma$ if and only if the MGF of the centered variable satisfies $\mathbb{E}\!\left[e^{t(X - \mathbb{E}[X])}\right] \leq e^{\sigma^2 t^2 / 2}$ for all $t \in \mathbb{R}$ . The simpler form $M_X(t) \leq e^{\sigma^2 t^2 / 2}$ holds only when $\mathbb{E}[X] = 0$ . Within the sub-Gaussian world, the entire concentration story reduces to bounding this centered MGF.

Exponential families are distributions whose density is $\exp(\theta^T T(x) - A(\theta))$ , relying on properties of the exponential function. The function $A(\theta)$ is the log-MGF of the sufficient statistic.

Want the moving picture first? The Chernoff / MGF Tilt Lab lets you drag the threshold and tilt parameter yourself, so you can see when the MGF gives a real certificate and when heavy tails break the method.

Quick Version

Object	Plain meaning	Why it matters
$M_X(t)=\mathbb E[e^{tX}]$	Exponential average of $X$	Encodes moments and tail sensitivity
$K_X(t)=\log M_X(t)$	Log-MGF / cumulant generator	Adds cleanly under independent sums
Chernoff method	Apply Markov to $e^{tX}$	Turns MGF control into tail bounds
Failure mode	MGF infinite for $t>0$	Heavy tails need different tools

The MGF is not just "a fancy way to store moments." It is the object that lets light-tailed distributions talk to concentration inequalities.

Visual Intuition

theorem visual

Chernoff Tilt

$The upper-tail bound becomes tight by choosing the slope that best tilts the log-MGF.$

Threshold t

t = 1.4

The best exponential upper tail bound comes from choosing the optimizer that minimizes the tilted intercept.

L amb d a (s)

: log-MGFAmber: tilted lineGreen: optimizer and gap

Read the picture this way:

the blue curve is the log-MGF / cumulant generator
the amber line is the exponential tilt you chose
the vertical gap is the rate that powers the upper-tail bound

This is the geometric core of the Chernoff method. If the log-MGF is finite and well behaved, you can tilt and optimize. If the MGF blows up, the whole certificate disappears.

Core Definitions

Definition

Moment Generating Function $M_{X} (t)$

The moment generating function of a random variable $X$ is:

$M_X(t) = \mathbb{E}[e^{tX}]$

defined for all $t \in \mathbb{R}$ where this expectation is finite. The MGF may not exist for all $t$ ; when it exists in an open interval around 0, it determines the distribution uniquely.

Extracting Moments

The $k$ -th moment of $X$ is the $k$ -th derivative of $M_X$ at zero:

$\mathbb{E}[X^k] = M_X^{(k)}(0)$

This follows from differentiating under the expectation:

$M_X^{(k)}(t) = \mathbb{E}[X^k e^{tX}]$

and evaluating at $t = 0$ . The interchange of differentiation and expectation is valid when $M_X$ exists in a neighborhood of $t$ .

In particular: $\mathbb{E}[X] = M_X'(0)$ and $\mathbb{E}[X^2] = M_X''(0)$ .

Main Theorems

Theorem

MGF Uniqueness Theorem

Statement

If two random variables $X$ and $Y$ have moment generating functions $M_X(t)$ and $M_Y(t)$ that are finite and equal for all $t$ in some open interval $(-\delta, \delta)$ with $\delta > 0$ , then $X$ and $Y$ have the same distribution.

Intuition

The MGF encodes the entire distribution, not just the moments. If two distributions agree on their MGFs in a neighborhood of zero, they must be the same distribution. This is stronger than moment matching: there exist distinct distributions with identical moments of all orders, but they cannot have identical MGFs in a neighborhood of zero.

Proof Sketch

The MGF is related to the characteristic function $\varphi_X(t) = \mathbb{E}[e^{itX}]$ by analytic continuation. If $M_X(t)$ is finite on $(-\delta, \delta)$ , the characteristic function extends analytically to a strip in the complex plane. By the uniqueness theorem for characteristic functions (Levy inversion), the distribution is determined.

Why It Matters

This theorem justifies the "MGF technique" for identifying distributions. If you compute the MGF of a sum of independent Gaussians and recognize it as the MGF of another Gaussian, you can conclude the sum is Gaussian. This approach is cleaner than convolution arguments.

Failure Mode

The MGF must exist in a neighborhood of 0, not just at 0 (where it always equals 1). Heavy-tailed distributions like the Cauchy distribution have no MGF. For those distributions, use the characteristic function instead, which always exists.

report a correction →

Proposition

MGF of Independent Sum

Statement

If $X$ and $Y$ are independent random variables whose MGFs exist, then:

$M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$

Intuition

Independence means $\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)] \mathbb{E}[h(Y)]$ . Apply this with $g(X) = e^{tX}$ and $h(Y) = e^{tY}$ .

Proof Sketch

$M_{X+Y}(t) = \mathbb{E}[e^{t(X+Y)}] = \mathbb{E}[e^{tX} e^{tY}] = \mathbb{E}[e^{tX}] \mathbb{E}[e^{tY}] = M_X(t) M_Y(t)$ , where the third equality uses independence.

Why It Matters

This is why MGFs are the natural tool for sums of independent variables. Addition of random variables corresponds to multiplication of MGFs. Taking logs: the cumulant generating function $\log M_X(t)$ is additive for independent sums.

Failure Mode

Fails without independence. For dependent variables, $\mathbb{E}[e^{tX}e^{tY}] \neq \mathbb{E}[e^{tX}]\mathbb{E}[e^{tY}]$ in general.

report a correction →

Canonical Examples

Example

MGF of a Gaussian

Let $X \sim N(\mu, \sigma^2)$ . Then:

$M_X(t) = \mathbb{E}[e^{tX}] = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)$

This exists for all $t \in \mathbb{R}$ . Setting $\mu = 0$ : $M_X(t) = \exp(\sigma^2 t^2 / 2)$ , which is the defining condition for sub-Gaussian random variables with parameter $\sigma$ .

Example

The Chernoff method in one line

For any $t > 0$ : $P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq e^{-ta} M_X(t)$ . The first step is monotonicity of $\exp$ ; the second is Markov's inequality. Optimize over $t > 0$ to get the tightest bound. This is the entire Chernoff method.

MGF Reference Table

The MGF of standard distributions, with domain of finiteness. Here $t \in \mathbb{R}$ unless restricted.

Distribution	$M_X(t)$	Domain (finite)
Bernoulli $(p)$	$1 - p + p e^t$	all $t$
Binomial $(n, p)$	$(1 - p + p e^t)^n$	all $t$
Poisson $(\lambda)$	$\exp(\lambda (e^t - 1))$	all $t$
Geometric $(p)$ , support $\{1, 2, \ldots\}$	$p e^t / (1 - (1-p) e^t)$	$t < -\log(1-p)$
Exponential $(\lambda)$ , rate	$\lambda / (\lambda - t)$	$t < \lambda$
Gamma $(\alpha, \beta)$ , rate $\beta$	$(\beta / (\beta - t))^\alpha$	$t < \beta$
Chi-squared $(k)$	$(1 - 2t)^{-k/2}$	$t < 1/2$
Normal $(\mu, \sigma^2)$	$\exp(\mu t + \sigma^2 t^2 / 2)$	all $t$
Uniform $(a, b)$	$(e^{tb} - e^{ta}) / (t(b - a))$ for $t \neq 0$ , $1$ at $t = 0$	all $t$

Bernoulli, Binomial, Poisson, and Normal have MGFs finite on all of $\mathbb{R}$ . Geometric, Exponential, Gamma, and Chi-squared are only finite on a left half-line. Heavy-tailed distributions (Cauchy, Pareto with small shape, lognormal) have $M_X(t) = \infty$ for all $t > 0$ .

Cumulants and Cumulant Generating Function

Definition

Cumulant Generating Function $K_{X} (t)$

The cumulant generating function is $K_X(t) = \log M_X(t)$ , defined on the open set where $M_X$ is finite. The $n$ -th cumulant is $\kappa_n = K_X^{(n)}(0)$ .

The low-order cumulants recover familiar summaries:

$\kappa_1 = \mathbb{E}[X], \quad \kappa_2 = \mathrm{Var}(X), \quad \kappa_3 = \mathbb{E}[(X - \mu)^3], \quad \kappa_4 = \mathbb{E}[(X - \mu)^4] - 3 \sigma^4.$

Skewness is $\kappa_3 / \kappa_2^{3/2}$ and excess kurtosis is $\kappa_4 / \kappa_2^2$ . A Gaussian has $\kappa_n = 0$ for all $n \geq 3$ , so nonzero higher cumulants measure non-Gaussianity.

The defining property: cumulants add under independent sums. If $X$ and $Y$ are independent, then $K_{X+Y}(t) = K_X(t) + K_Y(t)$ , hence $\kappa_n(X + Y) = \kappa_n(X) + \kappa_n(Y)$ for every $n$ . This is cleaner than the moment rule, which requires binomial sums. Cumulant additivity is exploited in independent component analysis (ICA), where sources are recovered by maximizing the absolute fourth cumulant of projections (an explicit non-Gaussianity measure) of the observed mixture.

Multivariate MGF

Definition

Multivariate MGF $M_{X} (t)$

For a random vector $X \in \mathbb{R}^d$ , the multivariate moment generating function is:

$M_X(t) = \mathbb{E}[\exp(t^\top X)], \quad t \in \mathbb{R}^d,$

on the set where the expectation is finite.

Key facts parallel the scalar case. If $M_X(t)$ is finite in an open neighborhood of the origin in $\mathbb{R}^d$ , it determines the joint distribution of $X$ uniquely. Mixed moments are recovered by partial derivatives: $\mathbb{E}[X_1^{k_1} \cdots X_d^{k_d}] = \partial^{k_1 + \cdots + k_d} M_X / \partial t_1^{k_1} \cdots \partial t_d^{k_d}$ evaluated at $t = 0$ . For a multivariate Gaussian $X \sim N(\mu, \Sigma)$ : $M_X(t) = \exp(t^\top \mu + t^\top \Sigma t / 2)$ , valid for all $t \in \mathbb{R}^d$ . If all linear combinations $a^\top X$ are Gaussian, $X$ is jointly Gaussian; this criterion is often easiest to check via the scalar MGF of $a^\top X$ .

When the multivariate MGF fails to exist, the multivariate characteristic function $\varphi_X(t) = \mathbb{E}[\exp(i t^\top X)]$ always does and plays the same role.

Connection to Concentration

MGFs power two adjacent machines. The Chernoff bound for $P(X \geq a)$ is obtained by Markov on $e^{tX}$ and optimizing $t > 0$ , so every Chernoff bound is an MGF computation. Sub-Gaussian random variables are defined by the MGF inequality $M_{X - \mathbb{E}X}(t) \leq \exp(\sigma^2 t^2 / 2)$ for all $t$ , and sub-exponential variables by a similar bound on a strip around 0. Concentration for bounded, Lipschitz, and light-tailed functionals reduces to bounding the CGF $K_X$ near 0.

Common Confusions

Watch Out

Moments existing does not imply MGF exists

A distribution can have all moments finite yet have no MGF. The lognormal distribution has $\mathbb{E}[X^k] < \infty$ for all $k$ but $M_X(t) = \infty$ for all $t > 0$ . The MGF is a stronger condition than having all moments.

Watch Out

MGF vs characteristic function vs cumulant generating function

The MGF is $M_X(t) = \mathbb{E}[e^{tX}]$ . The characteristic function is $\varphi_X(t) = \mathbb{E}[e^{itX}]$ , which always exists. The cumulant generating function (CGF) is $K_X(t) = \log M_X(t)$ . The CGF is additive for independent sums. In concentration inequality proofs, you work with the CGF.

Exercises

ExerciseCore

Problem

Compute the MGF of a Bernoulli( $p$ ) random variable. Use it to find $\mathbb{E}[X]$ and $\mathbb{E}[X^2]$ .

ExerciseAdvanced

Problem

Let $X_1, \ldots, X_n$ be i.i.d. $N(0, 1)$ . Use MGFs to prove that $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ is distributed as $N(0, 1/n)$ .

References

Canonical:

Casella & Berger, Statistical Inference (2002), Chapter 2.3
Billingsley, Probability and Measure (1995), Section 21

Current:

Wainwright, High-Dimensional Statistics (2019), Chapters 2-3 (MGFs in concentration)
Vershynin, High-Dimensional Probability (2018), Chapter 2 (sub-Gaussian MGF condition)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Exponential Function Propertieslayer 0A · tier 1

Derived topics

12

Distributions Atlaslayer 0A · tier 1
Normal Distributionlayer 0A · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Characteristic Functionslayer 1 · tier 1
Chernoff Boundslayer 1 · tier 1

+7 more on the derived-topics page.

Graph-backed continuations

Chernoff Bounds Concentration Inequalities Bernstein Inequality Characteristic Functions Bennett's Inequality Chi-Squared Concentration Hoeffding's Lemma The Multivariate Normal Distribution