Options and Temporal Abstraction

Sneiderman, Robby

RL Theory

Options and Temporal Abstraction

The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.

AdvancedTier 3StableSupporting~50 min

Prerequisites

Markov Decision Processes Value Iteration and Policy Iteration Policy Representations

Prereq Map

Why This Matters

Primitive actions in most MDPs operate at a single time scale: one step, one action. Complex tasks naturally decompose into subtasks that span many steps: "navigate to the door", "pick up the object", "stir the pot". The options framework (Sutton, Precup, Singh, 1999) formalizes temporally extended actions. An option is a policy that runs for multiple steps until a termination condition is met. Reasoning over options rather than primitive actions reduces the effective planning horizon and enables transfer of subtask solutions across different tasks.

Mental Model

Think of options as subroutines. A primitive action is a single instruction. An option is a function call: it takes over control, executes a sequence of primitive actions according to its internal policy, and returns control when its termination condition triggers. A policy over options is like a high-level program that calls subroutines. The agent plans at two levels: which option to invoke, and within each option, which primitive actions to take.

Formal Setup

Definition

Option $ω = (I, π_{ω}, β_{ω})$

An option $\omega$ consists of three components:

Initiation set $\mathcal{I}_\omega \subseteq S$ : the set of states where the option can be started
Internal policy $\pi_\omega(a \mid s)$ : a policy over primitive actions that the option follows while active
Termination function $\beta_\omega(s) \in [0,1]$ : the probability of terminating the option upon entering state $s$

A primitive action $a$ is a special case: $\mathcal{I} = S$ , the policy always selects $a$ , and $\beta(s) = 1$ for all $s$ (terminates after one step).

Definition

Semi-Markov Decision Process $S M D P$

A semi-MDP is an MDP where actions take variable amounts of time. When the agent selects option $\omega$ in state $s$ , the option runs for $k$ steps (a random variable depending on $\pi_\omega$ and $\beta_\omega$ ), accumulates discounted reward $r = \sum_{t=0}^{k-1} \gamma^t r_t$ , and transitions to a new state $s'$ . The SMDP transition model is:

$P(s', k \mid s, \omega) = \text{probability of reaching } s' \text{ in } k \text{ steps under option } \omega \text{ from } s$

Planning over options reduces the original MDP to an SMDP over the option set.

Definition

Policy Over Options $π_{Ω} (ω ∣ s)$

A policy over options $\pi_\Omega(\omega \mid s)$ selects which option to execute in each state where the previous option has terminated. The full hierarchical execution is: at state $s$ , select option $\omega \sim \pi_\Omega(\cdot \mid s)$ , execute $\pi_\omega$ until termination, then select a new option in the resulting state.

Core Theory

Theorem

Bellman Equation for Options

Statement

The value function over options satisfies the Bellman equation:

$V_\Omega(s) = \max_{\omega: s \in \mathcal{I}_\omega} \left[ r(s, \omega) + \sum_{s'} P(s' \mid s, \omega)\, U(s') \right]$

where $r(s, \omega)$ is the expected discounted reward accumulated while executing option $\omega$ from state $s$ , $P(s' \mid s, \omega)$ is the multi-step transition probability, and:

$U(s') = (1 - \beta_\omega(s'))\, Q_\Omega(s', \omega) + \beta_\omega(s')\, V_\Omega(s')$

The function $U(s')$ captures the continuation value: with probability $1 - \beta_\omega(s')$ the option continues (value $Q_\Omega(s', \omega)$ ), and with probability $\beta_\omega(s')$ it terminates and a new option is chosen (value $V_\Omega(s')$ ).

Intuition

This is the standard Bellman equation, but actions are replaced by options that run for variable durations. The key difference is the $U$ function, which accounts for the option possibly continuing or terminating at the next state. When all options are primitive actions ( $\beta = 1$ everywhere), $U(s') = V_\Omega(s')$ and this reduces to the standard Bellman equation.

Proof Sketch

Condition on the first step of the option. The option takes action $a \sim \pi_\omega(\cdot \mid s)$ , transitions to $s'$ , and then either terminates (probability $\beta_\omega(s')$ ) or continues (probability $1 - \beta_\omega(s')$ ). If it terminates, the agent selects a new option optimally, giving value $V_\Omega(s')$ . If it continues, the option keeps running from $s'$ , giving value $Q_\Omega(s', \omega)$ . Unrolling this recursion and taking the max over available options gives the stated equation.

Why It Matters

This equation enables value iteration and policy iteration over options, reducing the planning problem from reasoning over sequences of primitive actions to reasoning over temporally extended options. If good options are available, the effective planning depth is reduced from the task horizon to the number of option invocations.

Failure Mode

The equation assumes the options are fixed. If the options are poorly designed (e.g., an option that wanders randomly for many steps), planning over them can be worse than planning over primitive actions. The equation also does not address how to discover good options. It only tells you how to plan with given options.

report a correction →

Option-Critic Architecture

The Option-Critic (Bacon, Harb, Precup, 2017) learns options end-to-end using policy gradient methods. The architecture has three learned components:

Policy over options $\pi_\Omega(\omega \mid s)$ : selects which option to initiate
Intra-option policies $\pi_\omega(a \mid s)$ for each option $\omega$ : selects primitive actions
Termination functions $\beta_\omega(s)$ for each option $\omega$ : decides when to stop

All three are parameterized by neural networks and trained simultaneously. The key insight is that gradients for the termination function can be derived from the advantage of continuing the current option vs. switching:

$-\nabla_\theta \beta_\omega(s) \propto A_\Omega(s, \omega) = Q_\Omega(s, \omega) - V_\Omega(s)$

If the current option has higher value than the average over options ( $A_\Omega > 0$ ), the gradient pushes termination probability down (keep going). If it has lower value ( $A_\Omega < 0$ ), push termination probability up (stop and switch).

The degenerate option problem. Without regularization, Option-Critic tends to learn trivial solutions: one option that does everything, or options that terminate immediately (reducing to primitive actions). Adding a termination penalty (small cost for switching options) or a deliberation cost encourages temporally extended, distinct options.

Related Hierarchical RL Frameworks

The options framework is one of several approaches to hierarchical RL. A few closely related formalisms deserve mention.

MAXQ value function decomposition (Dietterich, 2000). MAXQ decomposes the value function of a target MDP into an additive combination of value functions of smaller constituent subtasks. The task hierarchy is specified by the designer as a directed acyclic graph of subtasks, each with its own termination predicate and pseudo-reward. Unlike options, MAXQ commits to a recursive value decomposition, which enables state abstraction within subtasks but requires the hierarchy to be given rather than discovered.

Hierarchies of Abstract Machines (HAM) (Parr and Russell, 1998). HAM constrains the agent's policy through a hierarchy of nondeterministic finite-state machines. Each machine has choice states where the learner picks among allowed transitions, and call states that invoke submachines. The composition of an HAM with an MDP yields an induced SMDP whose optimal policy respects the partial program. This is the most explicit "policy as program" view of hierarchical RL.

Feudal Networks (FuN) (Vezhnevets et al., 2017). FuN is a deep RL architecture with a Manager and a Worker. The Manager operates at a lower temporal resolution and emits directional subgoals in a learned latent space. The Worker is trained via an intrinsic reward that rewards moving the state representation in the direction specified by the Manager. FuN uses a transition policy gradient that treats the Manager's outputs as goal directions rather than discrete options, and it does not require initiation sets or termination functions.

Diversity Is All You Need (DIAYN) (Eysenbach et al., 2018). DIAYN is an unsupervised skill discovery method. A latent skill variable $z$ is sampled at the start of each episode, and the agent is trained to maximize the mutual information between $z$ and the visited states while maximizing policy entropy conditional on $z$ . This produces a set of distinguishable skills without any extrinsic reward, which can then serve as options or low-level policies for a downstream task. DIAYN is a concrete instance of the broader "skill discovery" line that options theory mostly leaves open.

Why Temporal Abstraction Matters

Reduced planning depth. A task that requires 1000 primitive steps might require only 10 option invocations. Planning 10 steps ahead is tractable; planning 1000 is not.

Transfer and compositionality. A "navigate to room" option learned in one task can be reused in another task that also requires navigation. Options provide a natural unit of transfer.

Exploration. Temporally extended exploration (committing to a direction for many steps) connects to the exploration vs. exploitation trade-off and can be more effective than random single-step exploration for tasks with sparse rewards and bottleneck states.

Common Confusions

Watch Out

Options are not the same as macro-actions

Macro-actions are fixed sequences of primitive actions. Options are more general: they have stochastic internal policies that can react to the current state, and probabilistic termination conditions that adapt to the situation. An option for "navigate to the door" will take different actions depending on obstacles, while a macro-action would execute the same fixed sequence regardless.

Watch Out

Semi-MDPs do not require options

Any situation where actions take variable time can be modeled as a semi-MDP. Options are one way to construct the temporally extended actions, but semi-MDPs also arise naturally in queuing systems, inventory management, and any setting where events occur at irregular intervals.

Summary

An option is a temporally extended action: initiation set + internal policy + termination condition
Options reduce the original MDP to a semi-MDP over the option set
The Bellman equation for options generalizes standard value iteration to variable-duration actions
Option-Critic learns options end-to-end via policy gradients on the intra-option policy and termination function
Temporal abstraction reduces planning depth, enables transfer, and improves exploration
Without regularization, learned options tend to degenerate to trivial solutions

Exercises

ExerciseCore

Problem

Consider an MDP with states $\{A, B, C, D\}$ and an option $\omega$ with initiation set $\{A, B\}$ , internal policy that deterministically goes $A \to B \to C$ , and termination function $\beta(A) = 0$ , $\beta(B) = 0$ , $\beta(C) = 1$ . If each transition gives reward 1 and $\gamma = 0.9$ , what is the expected discounted reward of executing this option starting from state $A$ ?

ExerciseAdvanced

Problem

In the Option-Critic framework, the termination gradient is proportional to $A_\Omega(s, \omega) = Q_\Omega(s, \omega) - V_\Omega(s)$ . Explain why a pure advantage-based termination gradient, without any regularization, leads to degenerate options. Describe two regularization approaches and their tradeoffs.

References

Canonical:

Sutton, Precup, Singh, Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning, Artificial Intelligence (1999)
Precup, Temporal Abstraction in Reinforcement Learning, PhD Thesis, University of Massachusetts (2000)

Current:

Bacon, Harb, Precup, The Option-Critic Architecture, AAAI (2017)
Riemer, Liu, Tesauro, Learning Abstract Options, NeurIPS (2018)

Related hierarchical RL frameworks:

Dietterich, Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition, Journal of Artificial Intelligence Research, 13, 227-303 (2000)
Parr and Russell, Reinforcement Learning with Hierarchies of Machines, Advances in Neural Information Processing Systems 10, NIPS (1998)
Vezhnevets et al., FeUdal Networks for Hierarchical Reinforcement Learning, ICML (2017), arXiv:1703.01161
Eysenbach, Gupta, Ibarz, Levine, Diversity Is All You Need: Learning Skills without a Reward Function, ICLR (2019), arXiv:1802.06070

Further directions

Hierarchical Actor-Critic (HAC) details
Universal Value Function Approximators (Schaul et al. 2015) and goal-conditioned RL connection
Bottleneck states and subgoal discovery methods
Language-conditioned options for LLM-based agents
Visualization of hierarchical option execution

Next Topics

Feudal networks and goal-conditioned hierarchical RL
Skill discovery and unsupervised option learning

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Markov Decision Processeslayer 2 · tier 1
Value Iteration and Policy Iterationlayer 2 · tier 1
Policy Representationslayer 3 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.