LLM Construction
Plan-then-Generate
Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.
Prerequisites
Why This Matters
Standard autoregressive language models generate text one token at a time, left to right, with no lookahead. Each token is conditioned on everything before it and nothing after it. This is a strong architectural constraint that creates a fundamental problem: the model must commit to early tokens before knowing what will come later.
For short outputs, this is fine. For long outputs (essays, code, proofs, stories), the lack of planning leads to incoherence, contradictions, and structural problems that cannot be fixed by making the model larger. Humans do not write this way. Humans plan outlines, draft sections, revise, and restructure. Plan-then-generate methods attempt to give language models similar capabilities.
Mental Model
Think of autoregressive generation as writing a novel by starting at the first word and never looking back. You might produce locally fluent text, but the plot will wander, characters will be forgotten, and the ending will not connect to the beginning. Plan-then-generate is more like writing with an outline: first decide the structure (sections, key points, argument flow), then fill in each section with awareness of the whole plan.
The fundamental tension: autoregressive models are trained to maximize , which is a local objective. Coherence is a global property. Planning bridges this gap by introducing an intermediate representation that captures global structure.
Formal Setup and Notation
Let be a token sequence. Standard autoregressive generation factors given context . Plan-then-generate introduces a latent plan and marginalizes:
This identity is just marginalization and by itself implies nothing about generation quality. It becomes useful only when two conditions hold. First, the plan acts as a near-sufficient statistic for global structure, so is closer to the true conditional than under a fixed model class. Second, the plan space is small enough that a trained model approximates well. Both conditions are empirical claims, not consequences of the factorization.
The plan can be an outline, a set of key points, a tree structure, or a sequence of future token predictions.
Planning as Latent Variable
A plan is a latent variable that captures the high-level structure of the output before token-level generation begins. The generative process is:
- Sample or construct a plan:
- Generate tokens conditioned on the plan:
The plan can be discrete (an outline with bullet points), continuous (a sequence of embeddings), or hierarchical (a tree of increasingly detailed specifications).
Key Approaches
Outline-then-Fill
Outline-then-fill generates a structured outline first, then fills in each section independently (or with cross-attention to other sections). The outline acts as a discrete plan that constrains the global structure.
Example: for a 5-paragraph essay, first generate 5 topic sentences, then expand each into a full paragraph conditioned on all topic sentences. This ensures the essay has coherent global structure even though each paragraph is generated locally.
Hierarchical Generation
Hierarchical generation produces output at multiple levels of granularity. Level 0: generate a high-level summary or skeleton. Level 1: expand each element of the skeleton into a detailed outline. Level 2: expand each outline item into full text. Each level conditions on the entire output of the previous level.
This mirrors how humans write complex documents: thesis statement, then section headers, then paragraphs, then sentences.
Multi-Token Prediction
Multi-token prediction trains the model to predict not just the next token but the next tokens simultaneously. The training loss is:
where is the -th prediction head. This forces the model to build internal representations that capture future context, acting as an implicit planning mechanism.
Formal Result: CoT Expanded Computational Expressivity
The strongest formal statement in this area is due to Merrill and Sabharwal (2024), which treats chain-of-thought (a visible plan) as a computational resource, not as a Bayesian smoothing trick.
Chain of Thought Expands Transformer Expressivity
Statement
A decoder-only transformer with no intermediate tokens recognizes a class contained in (constant-depth threshold circuits). Allowing the model to emit intermediate chain-of-thought tokens before its final answer strictly increases this class. With intermediate tokens the recognizable class contains problems outside , and with a polynomial number of intermediate tokens the class is contained in .
Intuition
Attention without intermediate tokens computes a fixed-depth, uniform function of the input. Intermediate tokens let the model reuse its own previous outputs as scratch space, which is a form of serial computation. The number of scratch tokens caps how much serial computation the model can perform.
Proof Sketch
Upper bound: a single forward pass of a fixed-precision transformer can be simulated by a circuit (Merrill-Sabharwal 2023). Lower bound: with intermediate tokens the model can simulate the step-by-step execution of a Turing machine for steps, because each token lets the next forward pass condition on prior scratch. Together these place the CoT-augmented class strictly above and inside when the scratch budget is polynomial.
Why It Matters
This is the first formal statement that "thinking out loud" is not cosmetic. It makes intermediate tokens a circuit-depth resource. It gives a principled reason why reasoning models such as o1 and R1, which spend many tokens on scratch before answering, can solve problems that a single forward pass cannot.
Failure Mode
The result says nothing about whether training finds the right program. A transformer may be capable of computing something with scratch tokens without actually learning to do so from next-token data. Capacity is not reachability.
Heuristic: Plans as Global Scratchpad
The coherence argument often made for planning is not a theorem. It is a modeling heuristic that follows from attention dilution over long contexts, together with empirical observations about long-form generation quality.
Planning does not reduce coherence error by the DPI
An earlier version of this page invoked the data processing inequality to argue that conditioning on a plan lowers generation entropy. That argument does not go through. The DPI states that for a Markov chain , . In plan-then-generate the plan is not a deterministic function of the context, nor a sufficient statistic for the target, so the Markov condition does not apply. The correct framing is that an explicit plan reduces the effective distance over which global constraints must propagate through hidden states, which is an assumption about model inductive bias, not an information-theoretic bound.
A cleaner way to state the heuristic: let be the token distance over which a coherence constraint must hold and let be the effective attention range over which the model reliably uses context. Empirically, generation quality degrades when . An explicit plan tokenized near the start of the context shortens the effective distance from to a small constant, because the plan stays inside the attention window throughout generation. This is a statement about model behavior, not a provable bound.
Multi-Token Prediction: What Can Be Said
The claim "multi-token prediction improves representation quality by DPI" is not valid either, for the same reason: the hidden state is not a sufficient statistic for the future, and training optima are not jointly comparable across loss functions. What can be said is much weaker and is consistent with the empirical results in Gloeckle et al. (2024).
The -token loss is
where each is a separate head sharing the backbone. Because the head is trained jointly with heads that must predict farther-ahead targets, the shared representation receives gradient signal about for . This is a regularization effect on the backbone, not a DPI-style bound. Whether the joint optimum has the 1-token optimum is an empirical question about the optimization landscape. Gloeckle et al. (2024) report gains on code, where strict long-range dependencies (matching braces, variable references) make far-ahead targets informative. For natural language the gains are smaller and not consistent.
The right mental model is: multi-token prediction is an auxiliary task that shapes the backbone toward future-aware features. It is not a sufficiency argument, and it does not reduce coherence error by any provable amount.
Current Research Directions
Planning in language models is an active research area with several threads:
Explicit planning via search. Systems like Tree-of-Thoughts (Yao et al. 2023, arXiv:2305.10601) and Graph-of-Thoughts (Besta et al. 2023, arXiv:2308.09687) use the model itself to generate and evaluate multiple candidate plans before committing to generation. This is expensive but effective for reasoning tasks.
Learned planning tokens. Some approaches train the model to produce special "planning tokens" that are not part of the output but influence the hidden state. These tokens act as a learned, continuous plan that the model constructs before generating visible output.
Diffusion-based text generation. Instead of generating left-to-right, diffusion models generate all tokens simultaneously and refine them iteratively. This naturally allows global planning because all positions are updated together.
Revision and editing. Rather than getting the output right in one pass, allow the model to generate a draft and then revise. This decomposes planning into initial generation (fast, possibly incoherent) and revision (fixing global structure).
Common Confusions
Chain-of-thought is not the same as planning
Chain-of-thought prompting produces intermediate reasoning steps that are part of the output. Planning produces a structure that guides generation but may not appear in the final output. Chain-of-thought is a special case where the plan is exposed, but true planning can use latent representations that are never shown to the user.
Multi-token prediction is not speculative decoding
Multi-token prediction trains multiple prediction heads during training to improve representation quality. Speculative decoding uses a draft model at inference time to speed up generation. They are complementary: a model trained with multi-token prediction can also use speculative decoding for faster inference.
Summary
- Autoregressive generation has no lookahead, which causes global incoherence for long outputs
- Planning introduces a latent or visible structure that guides token-level generation
- Outline-then-fill: explicit discrete plans, then conditioned generation
- Multi-token prediction: implicit planning via future-aware representations, an auxiliary task rather than a provable sufficiency argument
- Merrill and Sabharwal (2024) give the one real expressivity theorem: chain-of-thought tokens strictly expand what a fixed-precision transformer can recognize, from upward into
- Reasoning-trained models (o1, DeepSeek R1) turn CoT from a prompting trick into a trained capability
Exercises
Problem
Explain why a standard autoregressive model can produce locally fluent but globally incoherent text. Give a concrete example where planning would help.
Problem
Multi-token prediction with heads produces 4 logit distributions at each position. At inference time, you can only emit tokens one at a time. How would you use the extra heads during inference, and what advantage does this provide over a model trained with ?
Problem
Design a plan-then-generate training procedure for code generation. What should the plan contain? How would you obtain plan-code pairs for training? How would you evaluate whether planning improves over standard autoregressive generation?
References
Formal expressivity:
- Merrill and Sabharwal, "The Expressive Power of Transformers with Chain of Thought," ICLR 2024. Shows CoT tokens move decoder-only transformers from toward .
- Merrill and Sabharwal, "The Parallelism Tradeoff: Limitations of Log-Precision Transformers," TACL 2023. The upper bound used by the above.
Chain-of-thought and planning prompts:
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS 2022. Foundational CoT paper.
- Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models," ICLR 2023. Sampling multiple CoT paths and majority-voting.
- Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," NeurIPS 2023. Explicit search over partial plans.
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS 2023. Self-revision as a planning mechanism.
Training for reasoning:
- Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning," NeurIPS 2022. Rationale-augmented fine-tuning.
- Lightman et al., "Let's Verify Step by Step," ICLR 2024. Process reward models for step-level supervision.
- OpenAI, "OpenAI o1 System Card," September 2024. First deployed reasoning model trained for long CoT.
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," arXiv:2501.12948, 2025.
Multi-token prediction:
- Gloeckle et al., "Better and Faster Large Language Models via Multi-Token Prediction," ICML 2024. Auxiliary multi-head loss, strongest gains on code.
Next Topics
Natural extensions from plan-then-generate:
- Inference systems overview: how planning fits into broader LLM deployment
- Context engineering: managing context windows for effective planning
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Transformer Architecturelayer 4 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.