Skip to main content

Foundations

Joint, Marginal, and Conditional Distributions

Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.

CoreTier 1StableSupporting~40 min

Why This Matters

Probability gets interesting the moment one variable is no longer enough. You do not just care about a label YY or a feature vector XX by itself. You care about how they move together, what remains if you forget one variable, and what changes once one variable is revealed.

That is the whole joint / marginal / conditional story:

  • the joint tells you the full relationship
  • the marginal forgets one coordinate
  • the conditional freezes one coordinate and looks at the other

In ML, these are everywhere. A classifier models P(YX)P(Y \mid X). A generative model learns P(X)P(X) or P(X,Y)P(X,Y). Bayesian inference starts from a joint model and turns it into a posterior with Bayes theorem.

Quick Version

ObjectPlain meaningTypical ML question
Joint distribution p(x,y)p(x,y)The full map of how two variables co-occurWhat feature-label pairs appear together?
Marginal pX(x)p_X(x)What remains after forgetting YYHow often do these inputs appear overall?
Conditional p(yx)p(y \mid x)Distribution of YY once XX is fixedGiven this input, what label is likely?
IndependenceKnowing XX changes nothing about YYIs one variable informative about the other at all?

Want to build those mechanics by hand before reading the formal rules? Try the Probability Mechanics Lab, then come back here and read the definitions as compressed versions of the same picture.

Visual Map

Joint to conditional

One table contains the full probabilistic story

Start with the full co-occurrence table. Then forget one coordinate to get a marginal, or freeze one coordinate to get a conditional slice.

highlight column x = 1
Marginal over y
y = -1
22%
y = 0
28%
y = 1
34%
y = 2
16%
Joint table
x = -1
25%
x = 0
30%
x = 1
29%
x = 2
16%
0.14
x = -1, y = -1
0.06
x = 0, y = -1
0.02
x = 1, y = -1
0.00
x = 2, y = -1
0.08
x = -1, y = 0
0.12
x = 0, y = 0
0.06
x = 1, y = 0
0.02
x = 2, y = 0
0.03
x = -1, y = 1
0.09
x = 0, y = 1
0.14
x = 1, y = 1
0.08
x = 2, y = 1
0.00
x = -1, y = 2
0.03
x = 0, y = 2
0.07
x = 1, y = 2
0.06
x = 2, y = 2
Conditional slice
y = -1
7%
y = 0
21%
y = 1
48%
y = 2
24%
Freeze the highlighted column, then divide by its column total. That turns raw pair mass into a conditional distribution.
Joint

The full table is the joint law: .

Marginal

Add over the forgotten variable to get .

Conditional

Keep one column and renormalize it to get .

The visual below should be read left to right: the joint object holds all the information, marginals are what remains after summing out a variable, and the conditional slice is what remains after fixing one coordinate and renormalizing.

Core Definitions

Definition

Joint Distribution

For discrete random variables X,YX, Y: the joint PMF is p(x,y)=P(X=x,Y=y)p(x, y) = P(X = x, Y = y). For continuous random variables (see common probability distributions for standard families): the joint PDF f(x,y)f(x, y) satisfies P((X,Y)A)=Af(x,y)dxdyP((X, Y) \in A) = \iint_A f(x, y) \, dx \, dy. Both must be non-negative and sum/integrate to 1.

Definition

Marginal Distribution

The marginal of XX is obtained by summing or integrating out YY:

Discrete: pX(x)=yp(x,y)p_X(x) = \sum_y p(x, y)

Continuous: fX(x)=f(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f(x, y) \, dy

Marginalization discards information about YY.

Definition

Conditional Distribution

The conditional distribution of YY given X=xX = x is:

Discrete: p(yx)=p(x,y)pX(x)p(y|x) = \frac{p(x, y)}{p_X(x)} for pX(x)>0p_X(x) > 0

Continuous: f(yx)=f(x,y)fX(x)f(y|x) = \frac{f(x, y)}{f_X(x)} for fX(x)>0f_X(x) > 0

This defines a valid distribution over YY for each fixed xx.

Definition

Independence

Random variables XX and YY are independent (written XYX \perp Y) if and only if:

p(x,y)=pX(x)pY(y)for all x,yp(x, y) = p_X(x) \, p_Y(y) \quad \text{for all } x, y

Equivalently, p(yx)=pY(y)p(y|x) = p_Y(y) for all xx: knowing XX tells you nothing about YY.

Definition

Conditional Independence

XX and YY are conditionally independent given ZZ (written XYZX \perp Y \mid Z) if and only if:

p(x,yz)=p(xz)p(yz)for PZ-almost every zp(x, y \mid z) = p(x \mid z) \, p(y \mid z) \quad \text{for } P_Z\text{-almost every } z

In the discrete / strictly-positive density setting this reduces to "for all x,y,zx, y, z with p(z)>0p(z) > 0." For general (continuous, mixed, or singular) joint laws, conditional distributions are only defined up to PZP_Z-null sets, so the factorization is an almost-sure statement in zz — equivalently, E[g(X)h(Y)Z]=E[g(X)Z]E[h(Y)Z]\mathbb{E}[g(X) h(Y) \mid Z] = \mathbb{E}[g(X) \mid Z]\,\mathbb{E}[h(Y) \mid Z] a.s. for all bounded measurable g,hg, h, or factorization through regular conditional probabilities. Conditional independence is the foundation of graphical models and the naive Bayes assumption. For modeling complex dependence structures beyond independence, copulas separate the marginal behavior from the joint dependence.

Main Theorems

Theorem

Bayes Theorem

Statement

P(Y=yX=x)=P(X=xY=y)P(Y=y)P(X=x)P(Y = y \mid X = x) = \frac{P(X = x \mid Y = y) \, P(Y = y)}{P(X = x)}

In density form:

f(yx)=f(xy)fY(y)fX(x)f(y|x) = \frac{f(x|y) \, f_Y(y)}{f_X(x)}

where fX(x)=f(xy)fY(y)dyf_X(x) = \int f(x|y) f_Y(y) \, dy is the marginal (the "evidence").

Intuition

Bayes theorem inverts a conditional: it converts P(XY)P(X|Y) (the likelihood) into P(YX)P(Y|X) (the posterior) by incorporating the prior P(Y)P(Y). The denominator normalizes the result.

Proof Sketch

Start from the definition: P(YX)=P(X,Y)/P(X)P(Y|X) = P(X,Y)/P(X) and P(XY)=P(X,Y)/P(Y)P(X|Y) = P(X,Y)/P(Y). From the second equation, P(X,Y)=P(XY)P(Y)P(X,Y) = P(X|Y)P(Y). Substituting into the first gives Bayes theorem.

Why It Matters

Bayes theorem is the foundation of Bayesian inference. The prior encodes beliefs before seeing data. The likelihood connects data to parameters. The posterior combines both. In ML: Bayesian neural networks, Gaussian processes, and MAP estimation all rest on this formula.

Failure Mode

The denominator P(X=x)P(X = x) (or fX(x)f_X(x)) is often intractable to compute because it requires integrating over all possible YY. This is why approximate inference (MCMC, variational inference) is needed in practice.

Chain Rule and Law of Total Probability

Chain rule of probability for nn random variables:

P(X1,X2,,Xn)=P(X1)i=2nP(XiX1,,Xi1)P(X_1, X_2, \ldots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i \mid X_1, \ldots, X_{i-1})

This is not an assumption; it follows directly from the definition of conditional probability applied repeatedly.

Law of total probability (discrete version):

P(X=x)=yP(X=xY=y)P(Y=y)P(X = x) = \sum_y P(X = x \mid Y = y) \, P(Y = y)

This "averages out" the conditioning variable. It is used to compute marginals from conditionals and appears throughout mixture model derivations. These operations connect directly to computing expectations and moments.

Common Confusions

Watch Out

Marginal independence does not imply conditional independence

XX and YY can be marginally independent (XYX \perp Y) but conditionally dependent (X⊥̸YZX \not\perp Y \mid Z). The classic example: two independent causes X,YX, Y of a common effect ZZ. Once you observe ZZ, learning about XX tells you about YY (explaining away).

Watch Out

Conditional independence does not imply marginal independence

The converse also fails. XYZX \perp Y \mid Z does not imply XYX \perp Y. Example: ZBernoulli(0.5)Z \sim \text{Bernoulli}(0.5), X=ZX = Z, Y=ZY = Z. Then XYZX \perp Y \mid Z (both are deterministic given ZZ), but XX and YY are perfectly dependent marginally.

Watch Out

Conditioning on zero-probability events

For continuous random variables, P(X=x)=0P(X = x) = 0 for any specific xx. The conditional density f(yx)f(y|x) is defined as a ratio of densities, not as P(Y=yX=x)P(Y = y | X = x). Rigorous treatment uses the Radon-Nikodym derivative or disintegration of measures.

Canonical Examples

Example

Naive Bayes classifier

Naive Bayes assumes features are conditionally independent given the class: P(X1,,XdY)=i=1dP(XiY)P(X_1, \ldots, X_d \mid Y) = \prod_{i=1}^d P(X_i \mid Y). By Bayes theorem, P(YX)P(Y)iP(XiY)P(Y \mid X) \propto P(Y) \prod_i P(X_i \mid Y). This reduces the number of parameters from exponential in dd to linear in dd. The independence assumption is almost always false, but the classifier often works well because the decision boundary can still be correct even when the probability estimates are wrong.

Exercises

ExerciseCore

Problem

A medical test has sensitivity P(positivedisease)=0.95P(\text{positive} \mid \text{disease}) = 0.95 and specificity P(negativeno disease)=0.99P(\text{negative} \mid \text{no disease}) = 0.99. The disease prevalence is P(disease)=0.001P(\text{disease}) = 0.001. Compute P(diseasepositive)P(\text{disease} \mid \text{positive}).

ExerciseAdvanced

Problem

Let (X,Y)(X, Y) have joint density f(x,y)=2f(x, y) = 2 for 0xy10 \leq x \leq y \leq 1 and f(x,y)=0f(x, y) = 0 otherwise. Find the marginal densities fX(x)f_X(x) and fY(y)f_Y(y), and the conditional density f(yx)f(y|x).

References

Canonical:

  • Blitzstein & Hwang, Introduction to Probability (2019), Chapter 7.
  • Billingsley, Probability and Measure (1995), Chapter 1.
  • Durrett, Probability: Theory and Examples (2019), Chapter 1.
  • Grimmett & Stirzaker, Probability and Random Processes (2020), Chapters 1-3
  • Casella & Berger, Statistical Inference (2002), Chapters 1-2

For ML context:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 2, 8.
  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 2, 10.
  • Koller & Friedman, Probabilistic Graphical Models (2009), Chapters 2-4.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3