World Models and Planning

Sneiderman, Robby

Beyond LLMS

World Models and Planning

Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.

AdvancedTier 2FrontierSupporting~60 min

Prerequisites

Markov Decision Processes Era of Experience History of AI Model Based RL

Prereq Map

Why This Matters

Model-free RL (Q-learning, PPO) treats the environment as a black box and learns purely from trial and error. This is sample-inefficient: DQN needs hundreds of millions of frames to learn Atari games that humans master in minutes. It is also unsafe: you cannot test an action before executing it.

World models invert this: learn a model of the environment, then plan inside it. Imagine trajectories, evaluate actions, and choose the best plan. all without touching the real environment. This is how humans navigate: we simulate outcomes mentally before acting. World models bring this capability to RL agents.

World models mix easy forward computation with hard inference over future actions

Encoding observations, rolling the latent dynamics, and predicting reward or value are forward passes. Planning is different: it still searches through the objective over imagined futures.

The D-shaped blocks are ordinary neural computations. The amber objective factor J is where planning still needs search or optimization over candidate futures and actions.

Diagram language

filled circle = observed variable or chosen action

hollow circle = latent variable the system must infer or roll forward

rectangle = objective term or local compatibility term

rounded-side box = forward-computable function

The symbol language follows classical factor-graph notation and matches Yann LeCun's April 2026 description of world models as energy-based factor graphs: filled circles for observed variables, hollow circles for latents, rectangles for factors, and rounded-side boxes for forward-computable functions.

What is easy here

Given an observation and an action, the encoder, dynamics, reward, and value heads are just forward passes. That is why latent world models can be trained end to end.

What is still hard

The planning objective still has to compare branches, which is why model error compounds with horizon. Forward computation is cheap; choosing the best imagined branch is not.

The key warning

$The simulation-lemma pressure is still there: if one-step model error is ϵ, open-loop planning error grows on the order of ϵ H^{2} with horizon H .$

The figure uses the same factor-graph language Yann LeCun emphasized in an April 2026 LinkedIn post on world models as energy-based factor graphs: observed quantities are filled circles, latent states are hollow circles, rectangles are factors, and rounded-side boxes mark the parts that are easy to compute by forward propagation.

Mental Model

Think of a chess player analyzing a position. They do not need to play physical moves on a board. They simulate sequences of moves in their head, evaluate the resulting positions, and choose the best line. A world model is the learned "board" in the agent's head. Planning is the search over imagined move sequences.

The central tradeoff: a learned model is never perfect. Plans based on an imperfect model can be worse than model-free learning if the model errors compound over long horizons. The art of model-based RL is managing this tradeoff.

Formal Setup

Definition

Learned World Model

A learned world model consists of:

A representation function $h_\theta: \mathcal{O} \to \mathcal{Z}$ mapping observations to latent states
A dynamics model $f_\theta: \mathcal{Z} \times \mathcal{A} \to \mathcal{Z}$ predicting the next latent state
A reward predictor $r_\theta: \mathcal{Z} \times \mathcal{A} \to \mathbb{R}$ predicting immediate reward
Optionally, a decoder $d_\theta: \mathcal{Z} \to \mathcal{O}$ reconstructing observations (used for training, not planning)

Given a current observation $o_t$ , the model can simulate forward: $z_t = h_\theta(o_t)$ , $z_{t+1} = f_\theta(z_t, a_t)$ , $\hat{r}_t = r_\theta(z_t, a_t)$ , and so on for any sequence of actions.

Definition

Planning

Planning uses the world model to select actions by searching over imagined trajectories. Given the current latent state $z_t$ , planning evaluates candidate action sequences $\{a_t, a_{t+1}, \ldots, a_{t+H}\}$ by simulating them through $f_\theta$ and summing predicted rewards:

$\hat{R}(a_{t:t+H}) = \sum_{k=0}^{H} \gamma^k r_\theta(z_{t+k}, a_{t+k}), \quad z_{t+k+1} = f_\theta(z_{t+k}, a_{t+k})$

The agent executes the first action of the best plan and replans at the next step.

Definition

Simulation Lemma

The simulation lemma quantifies how model errors affect planning quality. If the model has per-step error $\epsilon$ (in transition prediction), then over a horizon $H$ , the value estimate error grows as $O(\epsilon H^2)$ in the worst case. This quadratic growth in horizon is the fundamental limitation of model-based planning.

Main Theorems

Theorem

Model-Based RL Regret via Simulation Lemma

Statement

Let $\hat{P}$ be a learned transition model with $\|\hat{P}(\cdot|s,a) - P(\cdot|s,a)\|_1 \leq \epsilon$ for all $(s,a)$ . Let $\hat{\pi}$ be the policy obtained by planning optimally in $\hat{P}$ . Then the performance gap between $\hat{\pi}$ and the true optimal policy $\pi^*$ satisfies:

$V^{\pi^*}(s) - V^{\hat{\pi}}(s) \leq \frac{2\gamma \epsilon R_{\max}}{(1-\gamma)^2}$

For a finite horizon $H$ , the bound becomes $O(\epsilon H^2 R_{\max})$ .

Intuition

Each step of planning introduces an error of order $\epsilon$ (the model is wrong by $\epsilon$ in TV distance). Over an effective horizon of $1/(1-\gamma)$ steps, these errors accumulate. The $(1-\gamma)^{-2}$ dependence means that long-horizon problems (small $1-\gamma$ ) amplify model errors quadratically. This is why model-based methods struggle with long-horizon planning unless the model is very accurate.

Proof Sketch

Decompose the value difference using a telescoping sum over time steps. At each step, the value under the true dynamics differs from the value under the model dynamics by at most $\gamma \epsilon \|V^*\|_\infty \leq \gamma \epsilon R_{\max}/(1-\gamma)$ . Summing over the effective horizon $1/(1-\gamma)$ gives the result.

Why It Matters

This theorem explains both the promise and the limitation of world models. The promise: if $\epsilon$ is small, model-based methods can find near-optimal policies without ever executing suboptimal actions in the real environment (sample efficiency). The limitation: the quadratic dependence on horizon means small model errors become large planning errors over long time scales. This motivates learning in latent space (where models can be more accurate) and short-horizon planning with replanning.

Failure Mode

The $O(\epsilon H^2)$ bound is a worst-case guarantee, so the actual planning error cannot exceed it (up to constants), but it is often very loose: in practice, random noise in model errors causes substantial cancellation across rollouts. The bound is tightest when errors are correlated across states (systematic bias rather than random noise). A model that consistently predicts slower dynamics, for example, produces systematically overconfident plans, and the realized error approaches the worst-case rate.

report a correction →

Dreamer: Latent World Models

The Dreamer family (v1, v2, v3) learns a latent-space world model and trains a policy entirely on imagined trajectories.

Architecture:

Encoder $h_\theta$ : maps image observations to latent states
Recurrent State Space Model (RSSM): combines deterministic recurrence with stochastic latent variables for dynamics prediction
Reward predictor and continuation predictor (predicts episode termination)
Decoder: reconstructs observations from latent states (for model training)

Training loop:

Proposition

Dreamer Imagination-Based Policy Optimization

Statement

Dreamer optimizes the policy $\pi_\theta$ by maximizing the expected imagined return:

$J(\theta) = \mathbb{E}_{\pi_\theta, f_\theta} \left[ \sum_{t=0}^{H} \gamma^t \hat{r}_t \right]$

where the expectation is over trajectories generated by rolling out $\pi_\theta$ in the learned world model $f_\theta$ . The policy gradient is computed by backpropagating through the differentiable world model (no REINFORCE needed).

The value function $V_\psi$ is trained on imagined trajectories to compute $\lambda$ -returns for the actor update, analogous to GAE in model-free actor-critic.

Intuition

Because the world model is a differentiable neural network, you can compute analytic gradients of the imagined return with respect to the policy parameters. This is structurally different from model-free policy gradients, which must estimate gradients from sampled rewards. Dreamer turns RL into supervised learning: the "data" is imagined trajectories, and the "labels" are the predicted rewards.

Why It Matters

Dreamer achieves leading sample efficiency on visual control tasks. Training the policy on imagined data means the agent can improve without additional real-world interactions. Dreamer v3 matches or exceeds model-free methods across diverse domains (Atari, DMC, Minecraft) while using 10-50x fewer environment steps.

report a correction →

MuZero: Learned Model + Tree Search

MuZero (DeepMind, 2020) combines a learned model with Monte Carlo Tree Search (MCTS), achieving superhuman performance on Go, chess, shogi, and Atari without knowing the rules of any game.

Key components:

Representation function $h$ : maps observation to initial latent state
Dynamics function $g$ : given latent state and action, predicts next latent state and immediate reward
Prediction function $f$ : given latent state, predicts policy and value (as in AlphaZero)

Critical insight: MuZero's dynamics function does not predict observations (pixels). It predicts latent states that are useful for planning. The model is trained end-to-end to produce accurate value and policy predictions after multiple steps of model rollout, not to reconstruct the environment faithfully.

The MCTS planning procedure uses the learned model to simulate forward and backpropagate value estimates through the search tree, just as AlphaZero does with the true game rules.

Video World Models

A recent frontier: using large pretrained video generation models as world simulators. The idea is that a model trained to predict future video frames has implicitly learned physics, object permanence, and dynamics.

Approach:

Train (or use a pretrained) video diffusion model on large-scale video data
Condition on the current frame and a proposed action (e.g., joystick input)
Generate future frames as a simulation of the action's consequences
Use the generated video for planning or policy training

Key challenges:

Controllability: standard video models predict what will happen, not what happens given a specific action. Action-conditioned generation requires architectural changes or fine-tuning
Consistency: generated videos can drift or hallucinate over long horizons
Speed: diffusion-based generation is slow, limiting the number of imagined trajectories that can be evaluated for planning

This approach has shown promising results in game environments and simple robotic settings, but the computational cost and consistency challenges remain significant barriers for real-time planning.

Google DeepMind's Genie family sharpened this story in stages. The 2024 Genie paper showed controllable environments from unlabeled internet video, while Genie 2 and Genie 3 extended the idea toward larger, more coherent, real-time 3D worlds in official system reports. The important caveat is evidence quality: some claims are backed by peer-reviewed papers, while the most recent frontier systems are still documented primarily through official research demos and technical blog posts.

Model-Free vs. Model-Based

	Model-Free	Model-Based
Sample efficiency	Low (millions of steps)	High (thousands of steps)
Computation per step	Low	High (model rollouts + planning)
Asymptotic performance	Can be optimal	Limited by model accuracy
Safety	Must try dangerous actions	Can simulate before acting
Long-horizon tasks	Robust (no compounding error)	Degrades as horizon grows

In practice, the best systems combine both: use a model for short-horizon planning and value estimation, but ground decisions in real experience to correct model errors. Dreamer exemplifies this: the model generates training data, but the policy is evaluated in the real environment.

Common Confusions

Watch Out

World models do not need to predict pixels

Early world models (Ha & Schmidhuber, 2018) generated pixel-level predictions. Modern approaches (MuZero, Dreamer) learn latent dynamics that never produce pixels during planning. The decoder is a training aid, not a planning component. Predicting in latent space is faster, more compact, and avoids wasting model capacity on irrelevant visual details.

Watch Out

Planning does not require a perfect model

A common objection is that model errors make planning useless. In reality, even crude models enable useful planning when combined with (1) short planning horizons with frequent replanning, (2) uncertainty estimation to avoid relying on uncertain predictions, and (3) real-world experience to correct model-based decisions. MuZero demonstrates superhuman performance despite imperfect latent dynamics.

Watch Out

LLMs are not world models in the RL sense

Language models can predict consequences of actions described in text, but they do not learn dynamics in a way that supports systematic search and planning. An RL world model must support repeated forward simulation at arbitrary action sequences, which current LLMs cannot do efficiently or accurately for physical environments. The relationship between LLM "world knowledge" and formal world models is an open research question.

Summary

World models: learn $f_\theta(z_t, a_t) \to z_{t+1}$ , then plan by simulating imagined trajectories
Simulation lemma: model error $\epsilon$ causes $O(\epsilon/(1-\gamma)^2)$ planning error. quadratic in effective horizon
Dreamer: latent RSSM world model, policy trained on imagined trajectories, backpropagation through differentiable model
MuZero: learned latent model + MCTS, does not predict observations, trained end-to-end for value/policy accuracy
Video world models: pretrained video generators as environment simulators
Model-based RL trades computation for sample efficiency

Exercises

ExerciseCore

Problem

If a learned model has per-step TV distance error $\epsilon = 0.01$ and $\gamma = 0.99$ , what is the worst-case value estimation error according to the simulation lemma? Assume $R_{\max} = 1$ .

ExerciseAdvanced

Problem

MuZero does not train its dynamics model to predict observations, only to produce accurate value and policy predictions after $K$ steps of model rollout. Why is this better than training the model to minimize observation prediction error? Give a concrete example where the two objectives disagree.

ExerciseResearch

Problem

The simulation lemma gives an $O(\epsilon H^2)$ error bound for planning with an imperfect model. Can you design a planning algorithm that achieves $O(\epsilon H)$ error instead? Under what additional assumptions?

Related Comparisons

Model-Based vs. Model-Free RL

References

Canonical:

Ha & Schmidhuber, "World Models" (NeurIPS 2018)
Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (Nature 2020). MuZero

Current:

Hafner et al., "Mastering Diverse Domains through World Models" (2023). Dreamer v3
Yang et al., "Learning Interactive Real-World Simulators" (UniSim, 2023). action-conditioned real-world simulation
Bruce et al., "Genie: Generative Interactive Environments" (2024). video world models
Google DeepMind, "Genie 2: A large-scale foundation world model" (December 4, 2024). official research report
Google DeepMind, "Genie 3: A new frontier for world models" (August 2025). official system preview
Yann LeCun, LinkedIn post "Oh yeah, world models are energy-based factor graphs" (April 2026). Short note on the factor-graph symbol conventions used for observed variables, latents, factors, and forward-computable functions.

Next Topics

The natural next steps from world models:

Video world models: action-conditioned video simulators and rollout fidelity
JEPA and joint embedding: predicting in representation space as the foundation for world models
Agentic RL and tool use: applying world models and planning to autonomous agents

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Markov Decision Processeslayer 2 · tier 1
The Era of Experiencelayer 4 · tier 1
Model-Based Reinforcement Learninglayer 3 · tier 2
History of Artificial Intelligencelayer 5 · tier 2

Derived topics

4

JEPA and Joint Embeddinglayer 4 · tier 2
Agentic RL and Tool Uselayer 5 · tier 2
Video World Modelslayer 5 · tier 2
World Model Evaluationlayer 5 · tier 3

Graph-backed continuations

Video World Models JEPA and Joint Embedding Agentic RL and Tool Use World Model Evaluation