Skip to main content

Beyond LLMS

World Models and Planning

Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.

AdvancedTier 2FrontierSupporting~60 min

Why This Matters

Model-free RL (Q-learning, PPO) treats the environment as a black box and learns purely from trial and error. This is sample-inefficient: DQN needs hundreds of millions of frames to learn Atari games that humans master in minutes. It is also unsafe: you cannot test an action before executing it.

World models invert this: learn a model of the environment, then plan inside it. Imagine trajectories, evaluate actions, and choose the best plan. all without touching the real environment. This is how humans navigate: we simulate outcomes mentally before acting. World models bring this capability to RL agents.

World models mix easy forward computation with hard inference over future actions

Encoding observations, rolling the latent dynamics, and predicting reward or value are forward passes. Planning is different: it still searches through the objective over imagined futures.

oₜaₜenczₜdynzₜ₊₁rewvalJ

The D-shaped blocks are ordinary neural computations. The amber objective factor J is where planning still needs search or optimization over candidate futures and actions.

Diagram language

filled circle = observed variable or chosen action
hollow circle = latent variable the system must infer or roll forward
rectangle = objective term or local compatibility term
rounded-side box = forward-computable function
The symbol language follows classical factor-graph notation and matches Yann LeCun's April 2026 description of world models as energy-based factor graphs: filled circles for observed variables, hollow circles for latents, rectangles for factors, and rounded-side boxes for forward-computable functions.

What is easy here

Given an observation and an action, the encoder, dynamics, reward, and value heads are just forward passes. That is why latent world models can be trained end to end.

What is still hard

The planning objective still has to compare branches, which is why model error compounds with horizon. Forward computation is cheap; choosing the best imagined branch is not.

The key warning

The simulation-lemma pressure is still there: if one-step model error is , open-loop planning error grows on the order of with horizon .

The figure uses the same factor-graph language Yann LeCun emphasized in an April 2026 LinkedIn post on world models as energy-based factor graphs: observed quantities are filled circles, latent states are hollow circles, rectangles are factors, and rounded-side boxes mark the parts that are easy to compute by forward propagation.

Mental Model

Think of a chess player analyzing a position. They do not need to play physical moves on a board. They simulate sequences of moves in their head, evaluate the resulting positions, and choose the best line. A world model is the learned "board" in the agent's head. Planning is the search over imagined move sequences.

The central tradeoff: a learned model is never perfect. Plans based on an imperfect model can be worse than model-free learning if the model errors compound over long horizons. The art of model-based RL is managing this tradeoff.

Formal Setup

Definition

Learned World Model

A learned world model consists of:

  • A representation function hθ:OZh_\theta: \mathcal{O} \to \mathcal{Z} mapping observations to latent states
  • A dynamics model fθ:Z×AZf_\theta: \mathcal{Z} \times \mathcal{A} \to \mathcal{Z} predicting the next latent state
  • A reward predictor rθ:Z×ARr_\theta: \mathcal{Z} \times \mathcal{A} \to \mathbb{R} predicting immediate reward
  • Optionally, a decoder dθ:ZOd_\theta: \mathcal{Z} \to \mathcal{O} reconstructing observations (used for training, not planning)

Given a current observation oto_t, the model can simulate forward: zt=hθ(ot)z_t = h_\theta(o_t), zt+1=fθ(zt,at)z_{t+1} = f_\theta(z_t, a_t), r^t=rθ(zt,at)\hat{r}_t = r_\theta(z_t, a_t), and so on for any sequence of actions.

Definition

Planning

Planning uses the world model to select actions by searching over imagined trajectories. Given the current latent state ztz_t, planning evaluates candidate action sequences {at,at+1,,at+H}\{a_t, a_{t+1}, \ldots, a_{t+H}\} by simulating them through fθf_\theta and summing predicted rewards:

R^(at:t+H)=k=0Hγkrθ(zt+k,at+k),zt+k+1=fθ(zt+k,at+k)\hat{R}(a_{t:t+H}) = \sum_{k=0}^{H} \gamma^k r_\theta(z_{t+k}, a_{t+k}), \quad z_{t+k+1} = f_\theta(z_{t+k}, a_{t+k})

The agent executes the first action of the best plan and replans at the next step.

Definition

Simulation Lemma

The simulation lemma quantifies how model errors affect planning quality. If the model has per-step error ϵ\epsilon (in transition prediction), then over a horizon HH, the value estimate error grows as O(ϵH2)O(\epsilon H^2) in the worst case. This quadratic growth in horizon is the fundamental limitation of model-based planning.

Main Theorems

Theorem

Model-Based RL Regret via Simulation Lemma

Statement

Let P^\hat{P} be a learned transition model with P^(s,a)P(s,a)1ϵ\|\hat{P}(\cdot|s,a) - P(\cdot|s,a)\|_1 \leq \epsilon for all (s,a)(s,a). Let π^\hat{\pi} be the policy obtained by planning optimally in P^\hat{P}. Then the performance gap between π^\hat{\pi} and the true optimal policy π\pi^* satisfies:

Vπ(s)Vπ^(s)2γϵRmax(1γ)2V^{\pi^*}(s) - V^{\hat{\pi}}(s) \leq \frac{2\gamma \epsilon R_{\max}}{(1-\gamma)^2}

For a finite horizon HH, the bound becomes O(ϵH2Rmax)O(\epsilon H^2 R_{\max}).

Intuition

Each step of planning introduces an error of order ϵ\epsilon (the model is wrong by ϵ\epsilon in TV distance). Over an effective horizon of 1/(1γ)1/(1-\gamma) steps, these errors accumulate. The (1γ)2(1-\gamma)^{-2} dependence means that long-horizon problems (small 1γ1-\gamma) amplify model errors quadratically. This is why model-based methods struggle with long-horizon planning unless the model is very accurate.

Proof Sketch

Decompose the value difference using a telescoping sum over time steps. At each step, the value under the true dynamics differs from the value under the model dynamics by at most γϵVγϵRmax/(1γ)\gamma \epsilon \|V^*\|_\infty \leq \gamma \epsilon R_{\max}/(1-\gamma). Summing over the effective horizon 1/(1γ)1/(1-\gamma) gives the result.

Why It Matters

This theorem explains both the promise and the limitation of world models. The promise: if ϵ\epsilon is small, model-based methods can find near-optimal policies without ever executing suboptimal actions in the real environment (sample efficiency). The limitation: the quadratic dependence on horizon means small model errors become large planning errors over long time scales. This motivates learning in latent space (where models can be more accurate) and short-horizon planning with replanning.

Failure Mode

The O(ϵH2)O(\epsilon H^2) bound is a worst-case guarantee, so the actual planning error cannot exceed it (up to constants), but it is often very loose: in practice, random noise in model errors causes substantial cancellation across rollouts. The bound is tightest when errors are correlated across states (systematic bias rather than random noise). A model that consistently predicts slower dynamics, for example, produces systematically overconfident plans, and the realized error approaches the worst-case rate.

Dreamer: Latent World Models

The Dreamer family (v1, v2, v3) learns a latent-space world model and trains a policy entirely on imagined trajectories.

Architecture:

  1. Encoder hθh_\theta: maps image observations to latent states
  2. Recurrent State Space Model (RSSM): combines deterministic recurrence with stochastic latent variables for dynamics prediction
  3. Reward predictor and continuation predictor (predicts episode termination)
  4. Decoder: reconstructs observations from latent states (for model training)

Training loop:

Proposition

Dreamer Imagination-Based Policy Optimization

Statement

Dreamer optimizes the policy πθ\pi_\theta by maximizing the expected imagined return:

J(θ)=Eπθ,fθ[t=0Hγtr^t]J(\theta) = \mathbb{E}_{\pi_\theta, f_\theta} \left[ \sum_{t=0}^{H} \gamma^t \hat{r}_t \right]

where the expectation is over trajectories generated by rolling out πθ\pi_\theta in the learned world model fθf_\theta. The policy gradient is computed by backpropagating through the differentiable world model (no REINFORCE needed).

The value function VψV_\psi is trained on imagined trajectories to compute λ\lambda-returns for the actor update, analogous to GAE in model-free actor-critic.

Intuition

Because the world model is a differentiable neural network, you can compute analytic gradients of the imagined return with respect to the policy parameters. This is structurally different from model-free policy gradients, which must estimate gradients from sampled rewards. Dreamer turns RL into supervised learning: the "data" is imagined trajectories, and the "labels" are the predicted rewards.

Why It Matters

Dreamer achieves leading sample efficiency on visual control tasks. Training the policy on imagined data means the agent can improve without additional real-world interactions. Dreamer v3 matches or exceeds model-free methods across diverse domains (Atari, DMC, Minecraft) while using 10-50x fewer environment steps.

MuZero: Learned Model + Tree Search

MuZero (DeepMind, 2020) combines a learned model with Monte Carlo Tree Search (MCTS), achieving superhuman performance on Go, chess, shogi, and Atari without knowing the rules of any game.

Key components:

  1. Representation function hh: maps observation to initial latent state
  2. Dynamics function gg: given latent state and action, predicts next latent state and immediate reward
  3. Prediction function ff: given latent state, predicts policy and value (as in AlphaZero)

Critical insight: MuZero's dynamics function does not predict observations (pixels). It predicts latent states that are useful for planning. The model is trained end-to-end to produce accurate value and policy predictions after multiple steps of model rollout, not to reconstruct the environment faithfully.

The MCTS planning procedure uses the learned model to simulate forward and backpropagate value estimates through the search tree, just as AlphaZero does with the true game rules.

Video World Models

A recent frontier: using large pretrained video generation models as world simulators. The idea is that a model trained to predict future video frames has implicitly learned physics, object permanence, and dynamics.

Approach:

  1. Train (or use a pretrained) video diffusion model on large-scale video data
  2. Condition on the current frame and a proposed action (e.g., joystick input)
  3. Generate future frames as a simulation of the action's consequences
  4. Use the generated video for planning or policy training

Key challenges:

  • Controllability: standard video models predict what will happen, not what happens given a specific action. Action-conditioned generation requires architectural changes or fine-tuning
  • Consistency: generated videos can drift or hallucinate over long horizons
  • Speed: diffusion-based generation is slow, limiting the number of imagined trajectories that can be evaluated for planning

This approach has shown promising results in game environments and simple robotic settings, but the computational cost and consistency challenges remain significant barriers for real-time planning.

Google DeepMind's Genie family sharpened this story in stages. The 2024 Genie paper showed controllable environments from unlabeled internet video, while Genie 2 and Genie 3 extended the idea toward larger, more coherent, real-time 3D worlds in official system reports. The important caveat is evidence quality: some claims are backed by peer-reviewed papers, while the most recent frontier systems are still documented primarily through official research demos and technical blog posts.

Model-Free vs. Model-Based

Model-FreeModel-Based
Sample efficiencyLow (millions of steps)High (thousands of steps)
Computation per stepLowHigh (model rollouts + planning)
Asymptotic performanceCan be optimalLimited by model accuracy
SafetyMust try dangerous actionsCan simulate before acting
Long-horizon tasksRobust (no compounding error)Degrades as horizon grows

In practice, the best systems combine both: use a model for short-horizon planning and value estimation, but ground decisions in real experience to correct model errors. Dreamer exemplifies this: the model generates training data, but the policy is evaluated in the real environment.

Common Confusions

Watch Out

World models do not need to predict pixels

Early world models (Ha & Schmidhuber, 2018) generated pixel-level predictions. Modern approaches (MuZero, Dreamer) learn latent dynamics that never produce pixels during planning. The decoder is a training aid, not a planning component. Predicting in latent space is faster, more compact, and avoids wasting model capacity on irrelevant visual details.

Watch Out

Planning does not require a perfect model

A common objection is that model errors make planning useless. In reality, even crude models enable useful planning when combined with (1) short planning horizons with frequent replanning, (2) uncertainty estimation to avoid relying on uncertain predictions, and (3) real-world experience to correct model-based decisions. MuZero demonstrates superhuman performance despite imperfect latent dynamics.

Watch Out

LLMs are not world models in the RL sense

Language models can predict consequences of actions described in text, but they do not learn dynamics in a way that supports systematic search and planning. An RL world model must support repeated forward simulation at arbitrary action sequences, which current LLMs cannot do efficiently or accurately for physical environments. The relationship between LLM "world knowledge" and formal world models is an open research question.

Summary

  • World models: learn fθ(zt,at)zt+1f_\theta(z_t, a_t) \to z_{t+1}, then plan by simulating imagined trajectories
  • Simulation lemma: model error ϵ\epsilon causes O(ϵ/(1γ)2)O(\epsilon/(1-\gamma)^2) planning error. quadratic in effective horizon
  • Dreamer: latent RSSM world model, policy trained on imagined trajectories, backpropagation through differentiable model
  • MuZero: learned latent model + MCTS, does not predict observations, trained end-to-end for value/policy accuracy
  • Video world models: pretrained video generators as environment simulators
  • Model-based RL trades computation for sample efficiency

Exercises

ExerciseCore

Problem

If a learned model has per-step TV distance error ϵ=0.01\epsilon = 0.01 and γ=0.99\gamma = 0.99, what is the worst-case value estimation error according to the simulation lemma? Assume Rmax=1R_{\max} = 1.

ExerciseAdvanced

Problem

MuZero does not train its dynamics model to predict observations, only to produce accurate value and policy predictions after KK steps of model rollout. Why is this better than training the model to minimize observation prediction error? Give a concrete example where the two objectives disagree.

ExerciseResearch

Problem

The simulation lemma gives an O(ϵH2)O(\epsilon H^2) error bound for planning with an imperfect model. Can you design a planning algorithm that achieves O(ϵH)O(\epsilon H) error instead? Under what additional assumptions?

Related Comparisons

References

Canonical:

  • Ha & Schmidhuber, "World Models" (NeurIPS 2018)
  • Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (Nature 2020). MuZero

Current:

  • Hafner et al., "Mastering Diverse Domains through World Models" (2023). Dreamer v3
  • Yang et al., "Learning Interactive Real-World Simulators" (UniSim, 2023). action-conditioned real-world simulation
  • Bruce et al., "Genie: Generative Interactive Environments" (2024). video world models
  • Google DeepMind, "Genie 2: A large-scale foundation world model" (December 4, 2024). official research report
  • Google DeepMind, "Genie 3: A new frontier for world models" (August 2025). official system preview
  • Yann LeCun, LinkedIn post "Oh yeah, world models are energy-based factor graphs" (April 2026). Short note on the factor-graph symbol conventions used for observed variables, latents, factors, and forward-computable functions.

Next Topics

The natural next steps from world models:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

4