RL Theory
Options and Temporal Abstraction
The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.
Why This Matters
Primitive actions in most MDPs operate at a single time scale: one step, one action. Complex tasks naturally decompose into subtasks that span many steps: "navigate to the door", "pick up the object", "stir the pot". The options framework (Sutton, Precup, Singh, 1999) formalizes temporally extended actions. An option is a policy that runs for multiple steps until a termination condition is met. Reasoning over options rather than primitive actions reduces the effective planning horizon and enables transfer of subtask solutions across different tasks.
Mental Model
Think of options as subroutines. A primitive action is a single instruction. An option is a function call: it takes over control, executes a sequence of primitive actions according to its internal policy, and returns control when its termination condition triggers. A policy over options is like a high-level program that calls subroutines. The agent plans at two levels: which option to invoke, and within each option, which primitive actions to take.
Formal Setup
Option
An option consists of three components:
- Initiation set : the set of states where the option can be started
- Internal policy : a policy over primitive actions that the option follows while active
- Termination function : the probability of terminating the option upon entering state
A primitive action is a special case: , the policy always selects , and for all (terminates after one step).
Semi-Markov Decision Process
A semi-MDP is an MDP where actions take variable amounts of time. When the agent selects option in state , the option runs for steps (a random variable depending on and ), accumulates discounted reward , and transitions to a new state . The SMDP transition model is:
Planning over options reduces the original MDP to an SMDP over the option set.
Policy Over Options
A policy over options selects which option to execute in each state where the previous option has terminated. The full hierarchical execution is: at state , select option , execute until termination, then select a new option in the resulting state.
Core Theory
Bellman Equation for Options
Statement
The value function over options satisfies the Bellman equation:
where is the expected discounted reward accumulated while executing option from state , is the multi-step transition probability, and:
The function captures the continuation value: with probability the option continues (value ), and with probability it terminates and a new option is chosen (value ).
Intuition
This is the standard Bellman equation, but actions are replaced by options that run for variable durations. The key difference is the function, which accounts for the option possibly continuing or terminating at the next state. When all options are primitive actions ( everywhere), and this reduces to the standard Bellman equation.
Proof Sketch
Condition on the first step of the option. The option takes action , transitions to , and then either terminates (probability ) or continues (probability ). If it terminates, the agent selects a new option optimally, giving value . If it continues, the option keeps running from , giving value . Unrolling this recursion and taking the max over available options gives the stated equation.
Why It Matters
This equation enables value iteration and policy iteration over options, reducing the planning problem from reasoning over sequences of primitive actions to reasoning over temporally extended options. If good options are available, the effective planning depth is reduced from the task horizon to the number of option invocations.
Failure Mode
The equation assumes the options are fixed. If the options are poorly designed (e.g., an option that wanders randomly for many steps), planning over them can be worse than planning over primitive actions. The equation also does not address how to discover good options. It only tells you how to plan with given options.
Option-Critic Architecture
The Option-Critic (Bacon, Harb, Precup, 2017) learns options end-to-end using policy gradient methods. The architecture has three learned components:
- Policy over options : selects which option to initiate
- Intra-option policies for each option : selects primitive actions
- Termination functions for each option : decides when to stop
All three are parameterized by neural networks and trained simultaneously. The key insight is that gradients for the termination function can be derived from the advantage of continuing the current option vs. switching:
If the current option has higher value than the average over options (), the gradient pushes termination probability down (keep going). If it has lower value (), push termination probability up (stop and switch).
The degenerate option problem. Without regularization, Option-Critic tends to learn trivial solutions: one option that does everything, or options that terminate immediately (reducing to primitive actions). Adding a termination penalty (small cost for switching options) or a deliberation cost encourages temporally extended, distinct options.
Related Hierarchical RL Frameworks
The options framework is one of several approaches to hierarchical RL. A few closely related formalisms deserve mention.
MAXQ value function decomposition (Dietterich, 2000). MAXQ decomposes the value function of a target MDP into an additive combination of value functions of smaller constituent subtasks. The task hierarchy is specified by the designer as a directed acyclic graph of subtasks, each with its own termination predicate and pseudo-reward. Unlike options, MAXQ commits to a recursive value decomposition, which enables state abstraction within subtasks but requires the hierarchy to be given rather than discovered.
Hierarchies of Abstract Machines (HAM) (Parr and Russell, 1998). HAM constrains the agent's policy through a hierarchy of nondeterministic finite-state machines. Each machine has choice states where the learner picks among allowed transitions, and call states that invoke submachines. The composition of an HAM with an MDP yields an induced SMDP whose optimal policy respects the partial program. This is the most explicit "policy as program" view of hierarchical RL.
Feudal Networks (FuN) (Vezhnevets et al., 2017). FuN is a deep RL architecture with a Manager and a Worker. The Manager operates at a lower temporal resolution and emits directional subgoals in a learned latent space. The Worker is trained via an intrinsic reward that rewards moving the state representation in the direction specified by the Manager. FuN uses a transition policy gradient that treats the Manager's outputs as goal directions rather than discrete options, and it does not require initiation sets or termination functions.
Diversity Is All You Need (DIAYN) (Eysenbach et al., 2018). DIAYN is an unsupervised skill discovery method. A latent skill variable is sampled at the start of each episode, and the agent is trained to maximize the mutual information between and the visited states while maximizing policy entropy conditional on . This produces a set of distinguishable skills without any extrinsic reward, which can then serve as options or low-level policies for a downstream task. DIAYN is a concrete instance of the broader "skill discovery" line that options theory mostly leaves open.
Why Temporal Abstraction Matters
Reduced planning depth. A task that requires 1000 primitive steps might require only 10 option invocations. Planning 10 steps ahead is tractable; planning 1000 is not.
Transfer and compositionality. A "navigate to room" option learned in one task can be reused in another task that also requires navigation. Options provide a natural unit of transfer.
Exploration. Temporally extended exploration (committing to a direction for many steps) connects to the exploration vs. exploitation trade-off and can be more effective than random single-step exploration for tasks with sparse rewards and bottleneck states.
Common Confusions
Options are not the same as macro-actions
Macro-actions are fixed sequences of primitive actions. Options are more general: they have stochastic internal policies that can react to the current state, and probabilistic termination conditions that adapt to the situation. An option for "navigate to the door" will take different actions depending on obstacles, while a macro-action would execute the same fixed sequence regardless.
Semi-MDPs do not require options
Any situation where actions take variable time can be modeled as a semi-MDP. Options are one way to construct the temporally extended actions, but semi-MDPs also arise naturally in queuing systems, inventory management, and any setting where events occur at irregular intervals.
Summary
- An option is a temporally extended action: initiation set + internal policy + termination condition
- Options reduce the original MDP to a semi-MDP over the option set
- The Bellman equation for options generalizes standard value iteration to variable-duration actions
- Option-Critic learns options end-to-end via policy gradients on the intra-option policy and termination function
- Temporal abstraction reduces planning depth, enables transfer, and improves exploration
- Without regularization, learned options tend to degenerate to trivial solutions
Exercises
Problem
Consider an MDP with states and an option with initiation set , internal policy that deterministically goes , and termination function , , . If each transition gives reward 1 and , what is the expected discounted reward of executing this option starting from state ?
Problem
In the Option-Critic framework, the termination gradient is proportional to . Explain why a pure advantage-based termination gradient, without any regularization, leads to degenerate options. Describe two regularization approaches and their tradeoffs.
References
Canonical:
- Sutton, Precup, Singh, Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning, Artificial Intelligence (1999)
- Precup, Temporal Abstraction in Reinforcement Learning, PhD Thesis, University of Massachusetts (2000)
Current:
- Bacon, Harb, Precup, The Option-Critic Architecture, AAAI (2017)
- Riemer, Liu, Tesauro, Learning Abstract Options, NeurIPS (2018)
Related hierarchical RL frameworks:
- Dietterich, Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition, Journal of Artificial Intelligence Research, 13, 227-303 (2000)
- Parr and Russell, Reinforcement Learning with Hierarchies of Machines, Advances in Neural Information Processing Systems 10, NIPS (1998)
- Vezhnevets et al., FeUdal Networks for Hierarchical Reinforcement Learning, ICML (2017), arXiv:1703.01161
- Eysenbach, Gupta, Ibarz, Levine, Diversity Is All You Need: Learning Skills without a Reward Function, ICLR (2019), arXiv:1802.06070
Further directions
- Hierarchical Actor-Critic (HAC) details
- Universal Value Function Approximators (Schaul et al. 2015) and goal-conditioned RL connection
- Bottleneck states and subgoal discovery methods
- Language-conditioned options for LLM-based agents
- Visualization of hierarchical option execution
Next Topics
- Feudal networks and goal-conditioned hierarchical RL
- Skill discovery and unsupervised option learning
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Markov Decision Processeslayer 2 · tier 1
- Value Iteration and Policy Iterationlayer 2 · tier 1
- Policy Representationslayer 3 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.