World Model Evaluation

Sneiderman, Robby

Beyond LLMS

World Model Evaluation

How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.

AdvancedTier 3FrontierSupporting~40 min

Prerequisites

World Models and Planning

Prereq Map

Why This Matters

A world model is only useful if policies trained inside it work in the real world. High prediction accuracy on single-step forecasts does not guarantee that multi-step plans will succeed. Evaluating world models requires metrics that capture what matters for downstream decision-making, not just perceptual fidelity.

Mental Model

Think of three levels of evaluation, from weakest to strongest:

Prediction accuracy: does the model predict the next observation correctly?
Controllability: can a policy trained entirely in the model succeed in the real environment?
Planning quality: do plans optimized in the model lead to high real-world reward?

Each level is strictly harder than the previous. A model can score well on prediction but fail on controllability.

Formal Setup

Definition

World Model

A learned world model $\hat{p}$ approximates the true environment dynamics $p(s_{t+1}, r_t | s_t, a_t)$ . The model may predict in observation space (pixel-level), latent space (learned representation), or a mixture.

Definition

Single-Step Prediction Error

The single-step prediction error measures how well the model predicts the immediate next state:

$\epsilon_1 = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ d(\hat{p}(\cdot | s, a), p(\cdot | s, a)) \right]$

where $d$ is a divergence measure (KL, Wasserstein, MSE in observation space, etc.) and $\mathcal{D}$ is a distribution over state-action pairs.

Main Theorems

Proposition

Compounding Error in Model-Based RL

Statement

Let $\hat{p}$ be a learned model with single-step total variation error at most $\epsilon$ under any policy, using the unnormalized $\mathrm{TV}(P,Q) = \|P - Q\|_1$ convention. Let $\pi$ be any policy. Then the true return $V^{\pi}$ and the model-estimated return $\hat{V}^{\pi}$ satisfy:

$|V^{\pi} - \hat{V}^{\pi}| \leq \frac{2\gamma\, \epsilon}{(1 - \gamma)^2}$

(Kearns & Singh 2002, Kakade 2003 thesis Lemma 2.2.2). Under the alternative convention $\mathrm{TV}(P,Q) = \tfrac{1}{2}\|P-Q\|_1$ the factor of 2 disappears. The effective horizon is $1/(1-\gamma)$ ; writing the bound with an explicit finite horizon $T$ and mixing it with $1/(1-\gamma)$ double-counts the horizon, so we state the discounted form only.

Intuition

Each step, the model's state distribution drifts from reality by at most $\epsilon$ in TV distance. Over the effective horizon $1/(1-\gamma)$ , these errors accumulate via the triangle inequality for TV distance. The $1/(1-\gamma)^2$ dependence arises because the effective horizon contributes one factor of $1/(1-\gamma)$ and the accumulated distributional shift contributes another.

Proof Sketch

Apply the simulation lemma: the difference in value functions between two MDPs is bounded by the expected sum of TV distances between their transition distributions along the trajectory, weighted by the discounted occupancy measure. With per-step TV error $\epsilon$ , telescoping over the geometric horizon gives the $2\gamma\epsilon/(1-\gamma)^2$ bound.

Why It Matters

This bound quantifies the sim-to-real gap. It shows that even small per-step errors can compound into large value function errors for long horizons. Reducing $\epsilon$ by a factor of 2 reduces the planning error by a factor of 2, but increasing the effective horizon has a quadratic worst-case effect in the discounted bound. This is why model-based RL with learned models often works best with short rollouts plus bootstrapped value estimates.

Failure Mode

The bound is worst-case and often loose. In practice, errors may cancel across steps (the trajectory stays in a region where the model is accurate) or the policy may be robust to small perturbations. Conversely, the bound assumes uniform $\epsilon$ , but models are typically worse in rarely-visited states, so on-policy error can be much larger than average error.

report a correction →

Evaluation Metrics in Practice

Prediction Metrics

MSE / PSNR: for pixel-level predictions. Easy to compute but does not capture perceptual quality.
SSIM: structural similarity, slightly better than MSE for images.
FVD (Frechet Video Distance): compares distributions of generated and real video clips using a pretrained feature extractor. The standard metric for video world models.
LPIPS: learned perceptual image patch similarity. Correlates better with human judgment than MSE.

Controllability Metrics

Sim-to-real transfer rate: train a policy in the model, deploy in reality, measure success rate.
Reward correlation: does the rank ordering of policies by imagined reward match their rank ordering by real reward?

Planning Metrics

Closed-loop return: run model-predictive control (MPC) with the learned model as the simulator, measure real-world return.
Open-loop trajectory error: predict $T$ steps into the future and compare to reality. This degrades quickly for large $T$ .

Common Confusions

Watch Out

Good prediction does not imply good planning

A model can predict video frames with low FVD but still produce bad plans. This happens when the model's errors are concentrated on decision-relevant features (object positions, contact events) rather than visual details (textures, lighting). Evaluation must target the features that matter for the task.

Watch Out

Training loss is not evaluation

A world model's training loss (e.g., reconstruction loss, KL divergence in a VAE) measures in-distribution fit. Evaluation must test out-of-distribution generalization, especially for states the policy will visit that differ from the training data. A model can have low training loss and high planning error if the policy exploits model inaccuracies.

Watch Out

FVD evaluates distributions, not individual trajectories

FVD compares the distribution of generated videos to the distribution of real videos. A high FVD can result from the model generating plausible but wrong trajectories (mode mixing) or from missing some modes of the real distribution. It does not tell you whether a specific generated trajectory matches a specific real trajectory.

Exercises

ExerciseCore

Problem

A learned model has single-step prediction error $\epsilon = 0.01$ in total variation (unnormalized convention). Using the compounding error bound $|V^{\pi} - \hat{V}^{\pi}| \leq 2\gamma\epsilon/(1-\gamma)^2$ , estimate the value function error for $\gamma = 0.99$ .

ExerciseAdvanced

Problem

A researcher reports that their world model achieves state-of-the-art FVD on a video prediction benchmark but the model-based RL agent trained in it performs worse than a model-free baseline. Give three possible explanations for this discrepancy.

References

Canonical:

Kearns & Singh, "Near-Optimal Reinforcement Learning in Polynomial Time," Machine Learning 49 (2002), original simulation lemma
Kakade, "On the Sample Complexity of Reinforcement Learning" (2003), PhD thesis, Ch. 2 for the unnormalized-TV form used here
Janner et al., "When to Trust Your Model: Model-Based Policy Optimization," NeurIPS (2019), arXiv:1906.08253
Luo et al., "A Survey on Model-Based Reinforcement Learning" (2022), arXiv:2206.09328

Current:

Unterthiner et al., "FVD: A New Metric for Video Generation," ICLR Workshop (2019)
Hafner et al., "Mastering Diverse Domains through World Models," arXiv:2301.04104 (2023)
Ross & Bagnell, "Agnostic System Identification for Model-Based RL," ICML (2012), tighter finite-horizon variants

Next Topics

Model-based RL: using world models for policy optimization

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

World Models and Planninglayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.