Beyond LLMS
Video World Models
Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.
Prerequisites
Why This Matters
Training RL agents (built on Markov decision processes) and robots in the real world is slow, expensive, and dangerous. Simulators are fast and safe but require hand-engineered physics and graphics. Video world models offer a third option: learn a simulator directly from video data, then train agents inside it.
The core idea is simple. A video generation model (often built on diffusion models) already knows how the visual world evolves over time. If you condition that model on actions (joystick inputs, motor commands), it becomes an interactive simulator that predicts what happens next given what you do.
A video world model is only useful when action control survives the rollout
Tokenization, latent rollout, and decoding are forward-computable. The hard part is keeping the future both temporally consistent and genuinely action-conditioned after several generated steps.
The amber controllability factor is what separates a usable simulator from a passive video prior. The violet temporal factor is what keeps the rollout coherent over multiple generated steps.
Diagram language
What is computable directly
The tokenizer, rollout model, and decoder are ordinary neural functions. Once the current context and action are fixed, they produce latent futures and decoded frames by forward propagation.
What still fails in practice
Pretty frames do not imply controllable simulation. The action edge can go weak long before the model stops looking photorealistic.
Why horizon is still fragile
Useful rollout depth is bounded by compounding transition error, not by visual quality alone. Long horizons demand both temporal coherence and persistent action-conditioning.
This diagram language follows classical factor-graph notation and Yann LeCun's April 2026 description of world models as energy-based factor graphs: filled circles for observed quantities, hollow circles for latent states, rectangles for factors, and rounded-side boxes for forward-computable functions.
Mental Model
A video world model is a learned transition function:
where is the history of observed frames and is the action taken at time . The model generates the next frame (or a latent representation of it) conditioned on past observations and the action. By chaining predictions autoregressively, you can roll out entire trajectories without interacting with the real environment.
Formal Setup and Notation
Video World Model
A video world model is a generative model that predicts future visual observations conditioned on past observations and actions. The model can be:
- Pixel-space: directly generates RGB frames
- Latent-space: generates compressed representations via a learned encoder/decoder pair
Training data consists of trajectories collected from real environments or existing video datasets.
Imagination-Based Planning
Given a world model , a policy , and a reward model , imagination-based planning evaluates a policy by:
where and . All rollouts happen inside the model. No real environment interaction is needed.
Core Definitions
The fidelity of a video world model is how closely its generated trajectories match real environment dynamics. High fidelity means agents trained in the model transfer well to the real world.
The controllability of a video world model measures how reliably actions produce the intended effects in generated frames. A model with low controllability may ignore action inputs and generate plausible but uncontrollable videos.
The horizon is how many steps the model can predict before error accumulates and predictions become unreliable. Compounding errors are the primary failure mode of autoregressive rollouts.
Main Theorems
Simulation Lemma for Learned World Models
Statement
Let be the true environment and a learned world model. Let be the reward upper bound. Use the unnormalized total variation (no factor). If for all , then for any policy :
where and are the expected returns in the true and model environments. Under the other common convention (range ), the factor of 2 disappears and the bound is . The factor is essential for dimensional correctness: without it the bound compares a value (reward units) to a probability (unitless). Kearns & Singh 2002 "Near-Optimal Reinforcement Learning in Polynomial Time" state the original form; Kakade 2003 thesis uses the unnormalized-TV variant above.
Intuition
Small per-step prediction errors compound geometrically over the planning horizon. The factor reflects this compounding: one factor of (the effective horizon) comes from summing discounted rewards, and another from the accumulation of distributional shift across steps. In finite-horizon notation the scaling is cumulative value error. This bound motivates keeping rollout horizons short and model accuracy high.
Proof Sketch
Telescope the difference in value functions across time steps. At each step, the one-step error contributes at most in TV distance, which translates to a reward difference of at most (applying the TV-to-expectation bound with ). Summing over the geometric horizon yields an extra factor of from value propagation and another from discounted summation.
Why It Matters
This bound quantifies the fundamental tradeoff in model-based RL: longer imagination rollouts give more data but accumulate more error. Practical systems like Dreamer use short rollout horizons (15-50 steps) precisely because of this compounding.
Failure Mode
The bound assumes uniform accuracy across all states and actions. In practice, the world model may be accurate in regions the agent has visited but poor in novel regions. Off-distribution states can produce arbitrarily bad predictions, making the uniform assumption unrealistic for exploration.
Key Approaches
GAIA-1: Action-Conditioned Video Diffusion for Driving
GAIA-1 (Hu et al., Wayve, 2023) conditions a large video generation model on actions and text for autonomous driving simulation. The architecture combines a discrete token world model with a video diffusion decoder, and accepts ego-vehicle actions (steering, speed) as conditioning. Key insight: a pretrained video backbone already captures visual dynamics, and fine-tuning with action-labeled driving data teaches it to respond to control inputs. Related systems include UniSim (Yang et al. 2023), which learns a universal action-conditioned simulator across robotics and driving domains, and Diamond (Alonso et al. 2024), which uses diffusion world models to train agents on Atari.
Genie family: from unlabeled video to playable worlds
Genie (Bruce et al., 2024) learns a world model from unlabeled internet video rather than action-labeled trajectories. It infers a latent action space from observed transitions: if the camera consistently moves left, that becomes part of the learned controllable basis. At inference time, users provide actions in this learned latent space to interact with the model.
Genie 2 and Genie 3 extend the same ambition toward larger-scale playable 3D environments. Those later systems matter, but the current public evidence comes primarily from official research reports and demos rather than the same style of paper-backed evaluation available for Genie 1. That distinction matters when judging transfer, consistency, and planning reliability.
Dreamer: Latent-Space World Models for RL
Dreamer (V1 through V3) learns a world model in a compact latent space rather than pixel space. The Recurrent State-Space Model (RSSM) maintains a latent state and predicts given and . A decoder maps latent states back to pixels for visualization. Training the policy and value function happens entirely in latent space, which is computationally cheaper than pixel-space rollouts.
Canonical Examples
Training an Atari agent inside a world model
Dreamer-V3 achieves human-level performance on Atari by: (1) collecting a small amount of real environment data, (2) training an RSSM world model on this data, (3) rolling out thousands of imagined trajectories inside the model, (4) training a policy using these imagined trajectories. The agent requires 10-100x fewer real environment steps than model-free methods because most training happens in imagination.
Common Confusions
Video generation is not the same as a world model
Sora (OpenAI 2024) and similar video generation models produce visually realistic videos but do not accept fine-grained action inputs. Without action conditioning, they cannot serve as interactive simulators. A world model must respond to actions; a video generator only needs to produce plausible continuations. NVIDIA's Cosmos (2025) platform sits between these, packaging pretrained video foundation models for downstream action-conditioned fine-tuning.
High visual fidelity does not imply good dynamics
A video world model can produce photorealistic frames that violate physics. A ball might pass through a wall or an object might teleport. Visual quality and dynamical accuracy are separate properties. For RL training, dynamical accuracy matters far more than visual quality.
Compounding error is not just noise accumulation
Each prediction step shifts the state distribution. After steps, the model may be in a region of state space it has never seen during training. The error at step is not times the one-step error; it can be much worse because the model is extrapolating. This distributional shift is qualitatively different from additive noise.
Exercises
Problem
The simulation lemma gives a bound of for the value difference between true and model environments. Take . If and you want the value difference to be at most 0.1, what per-step TV accuracy do you need?
Problem
Genie learns a latent action space from unlabeled video. What assumptions about the video data are needed for the inferred actions to be meaningful? Can you construct a dataset where the inferred latent actions are completely unrelated to any physically meaningful control?
References
Canonical:
- Ha & Schmidhuber, World Models (2018), arXiv:1803.10122
- Hafner et al., Mastering Diverse Domains through World Models (DreamerV3, 2023), arXiv:2301.04104
- Schrittwieser et al., Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero, 2020), Nature, arXiv:1911.08265
Current action-conditioned video models:
- Hu et al., GAIA-1: A Generative World Model for Autonomous Driving (Wayve, 2023), arXiv:2309.17080
- Yang et al., Learning Interactive Real-World Simulators (UniSim, 2023), arXiv:2310.06114
- Bruce et al., Genie: Generative Interactive Environments (2024), arXiv:2402.15391
- Google DeepMind, Genie 2: A large-scale foundation world model (December 4, 2024), official research report
- Google DeepMind, Genie 3: A new frontier for world models (2025), official system preview
- Alonso et al., Diffusion for World Modeling: Visual Details Matter in Atari (Diamond, 2024), NeurIPS, arXiv:2405.12399
- OpenAI, Video generation models as world simulators (Sora technical report, 2024)
- NVIDIA, Cosmos World Foundation Model Platform for Physical AI (2025), arXiv:2501.03575
- Yann LeCun, LinkedIn post "Oh yeah, world models are energy-based factor graphs" (April 2026). Short note on using filled circles for observed variables, hollow circles for latents, rectangles for factors, and rounded-side boxes for forward-computable functions.
Theory:
- Kearns & Singh, Near-Optimal Reinforcement Learning in Polynomial Time (2002), Machine Learning 49, original simulation lemma with explicit
- Kakade, On the Sample Complexity of Reinforcement Learning (2003), PhD thesis, Ch. 2 for the unnormalized-TV form of the simulation lemma used above
- Janner et al., When to Trust Your Model: Model-Based Policy Optimization (MBPO, 2019), arXiv:1906.08253, quantifies the rollout-horizon vs model-error tradeoff
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Diffusion Modelslayer 4 · tier 1
- World Models and Planninglayer 4 · tier 2
Derived topics
1- Agentic RL and Tool Uselayer 5 · tier 2
Graph-backed continuations