RL Theory
Actor-Critic Methods
The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.
Prerequisites
Why This Matters
REINFORCE has unbiased gradients but impractical variance. Q-learning has low variance but cannot handle continuous actions and suffers from the deadly triad. Actor-critic methods combine the best of both: a critic (value function) reduces variance, while an actor (policy network) handles continuous and high-dimensional action spaces.
PPO, the most widely used actor-critic algorithm, trains the RLHF stage of ChatGPT, Claude, and virtually every instruction-tuned language model. SAC is the workhorse for continuous robotic control. Understanding actor-critic is understanding how modern AI systems are trained.
Mental Model
The actor proposes actions. The critic evaluates them. The actor improves by increasing the probability of actions the critic says are better than average (positive advantage) and decreasing the probability of actions that are worse than average (negative advantage). The critic improves by getting better at predicting how much total reward a state will yield.
This is policy iteration made differentiable: the critic does approximate policy evaluation, the actor does approximate policy improvement.
Formal Setup and Notation
We work in the MDP framework with a parameterized policy (the actor) and a learned value function (the critic). The advantage function is .
TD Error
The temporal difference (TD) error at time is:
This is a one-step estimate of the advantage: when .
Generalized Advantage Estimation
Generalized Advantage Estimation (GAE) is an exponentially weighted sum of multi-step TD errors:
where controls the bias-variance tradeoff:
- : one-step TD advantage (low variance, high bias)
- : Monte Carlo advantage (high variance, low bias)
Main Theorems
GAE Bias-Variance Tradeoff
Statement
Define the -step advantage estimator:
The GAE estimator is the -weighted average:
When (perfect critic), all estimators are unbiased. When , the bias of the -step estimator is . Longer rollouts reduce bias from an imperfect critic, but variance grows with because more stochastic rewards enter the estimate. The parameter interpolates smoothly between these extremes.
Intuition
A perfect critic gives zero-variance advantage estimates via one-step TD. An imperfect critic introduces bias, which you can reduce by relying more on actual observed rewards (larger ). GAE provides a smooth dial between trusting the critic () and trusting the data (). In practice, is a common default.
Why It Matters
Without GAE, you must choose between one-step TD (biased but stable) and Monte Carlo returns (unbiased but noisy). GAE gives a principled middle ground and is used in every modern policy gradient implementation, including PPO.
A2C and A3C
A2C (Advantage Actor-Critic) runs multiple parallel environments synchronously:
- Collect -step rollouts from parallel environments
- Compute GAE advantages for each rollout
- Update the actor:
- Update the critic:
A3C (Asynchronous Advantage Actor-Critic), introduced by Mnih et al. (2016, ICML, arXiv:1602.01783), uses asynchronous CPU workers that each maintain a local copy of the parameters and apply Hogwild-style updates to a shared parameter server. A3C was historically important as the first demonstration that on-policy actor-critic could match DQN on Atari without a replay buffer. It has been largely replaced by synchronous A2C and PPO, which are simpler to reason about and equally effective on GPU hardware. Clemente, Castejón, and Chandra (2017, arXiv:1708.05144) showed empirically that synchronous A2C matches or exceeds A3C when the same compute is used.
PPO: The Practical Standard
PPO Clipped Surrogate Objective
Statement
Define the probability ratio . The PPO-Clip objective is:
where (typically 0.1 to 0.2) limits how far the new policy can deviate from the old one in a single update.
Effect of the clipping:
- When (good action): the objective is , capping the benefit of increasing the action probability
- When (bad action): the objective is , capping the penalty reduction from decreasing the action probability
Intuition
TRPO solves a constrained optimization problem (maximize surrogate subject to KL constraint), which requires second-order methods. PPO approximates this with a simple clipping trick: if the policy ratio moves too far from 1, the gradient is zeroed out, preventing destructively large updates. The result is a first-order algorithm that is nearly as stable as TRPO and much simpler to implement.
Why It Matters
PPO is the algorithm used in RLHF for training instruction-following language models. Its simplicity (no KL constraint optimization, no conjugate gradients) makes it the default choice whenever policy optimization is needed. The clipping mechanism is the key innovation. It provides a soft trust region without the computational overhead of TRPO.
The full PPO loss combines three terms:
where is the value function loss and is an entropy bonus encouraging exploration.
SAC: Maximum Entropy RL
Soft Bellman Equation
Statement
In maximum entropy RL, the objective augments the standard return with an entropy term:
The corresponding soft Bellman equation is:
The optimal policy is the Boltzmann distribution: .
Intuition
SAC adds a reward for being uncertain. The agent is encouraged to find all good actions, not just the single best one. This leads to better exploration and more robust policies. The temperature controls how much randomness the policy retains. high means more exploration, low approaches the standard (deterministic) optimal policy.
Why It Matters
SAC achieves strong performance on continuous control benchmarks (robotic locomotion, manipulation). Its key advantages: (1) it is off-policy (uses a replay buffer, so it is sample-efficient), (2) the entropy bonus provides automatic exploration, and (3) it avoids the brittle hyperparameter tuning of epsilon-greedy or noise-based exploration.
SAC maintains three networks: a policy (actor), two Q-functions (critics, using the double-Q trick to prevent overestimation), and optionally learns automatically by targeting a desired entropy level.
Why Actor-Critic Dominates
| Setting | Algorithm | Reason |
|---|---|---|
| LLM training (RLHF) | PPO | Discrete tokens, on-policy stability |
| Robotic control | SAC | Continuous actions, sample efficiency |
| Game playing | A2C / PPO | Parallel environments, scalability |
| Fine-tuning with verifiers | GRPO / PPO | Reward model as critic |
The actor-critic framework is flexible enough to accommodate all these settings because it separates the policy representation (actor) from the value estimation (critic). You can swap architectures, loss functions, and training procedures while keeping the same fundamental structure.
Key Practical Defaults
The following hyperparameter defaults appear in the original papers and remain the starting point for practitioner code. They are not universal, but they are the numbers most re-implementations begin from.
| Parameter | Default | Source |
|---|---|---|
| GAE | 0.95 | Schulman et al. (2016) |
| Discount | 0.99 | Schulman et al. (2016, 2017) |
| PPO clip | 0.2 | Schulman et al. (2017, Atari setup) |
| PPO epochs per rollout | 3 to 10 | Schulman et al. (2017) |
| Value coefficient | 0.5 to 1.0 | Schulman et al. (2017) |
| Entropy coefficient | 0.0 to 0.01 | Schulman et al. (2017) |
| Gradient clip norm | 0.5 | Andrychowicz et al. (2021) |
| SAC target entropy | Haarnoja et al. (2018, v2) | |
| SAC Polyak | 0.005 | Haarnoja et al. (2018) |
Andrychowicz et al. (2021, arXiv:2006.05990) tabulate sensitivity: advantage normalization is high-impact, specific learning-rate schedule is medium- impact, and several hyperparameters often treated as load-bearing in practitioner folklore (orthogonal init scale, value-loss clip range) have minor effect once others are tuned.
Common Confusions
The critic is not the reward model
In RLHF, the reward model scores outputs. The critic estimates the expected cumulative reward from a state under the current policy. The reward model is fixed after training; the critic is updated continuously during RL training. They serve different roles: the reward model defines the objective, the critic helps optimize it efficiently.
PPO is not the same as TRPO
TRPO solves a constrained optimization problem using conjugate gradients and line search. PPO uses a clipped surrogate objective with standard SGD. PPO does not enforce a hard KL constraint. The clipping is a soft approximation. In practice, PPO can violate the implicit trust region, which is both a feature (faster progress) and a risk (instability).
Actor-critic is not just REINFORCE with a baseline
REINFORCE with a baseline uses a learned value function to reduce variance but still uses Monte Carlo returns for the policy gradient. Actor-critic uses the critic for bootstrapping (TD estimates), which introduces bias but dramatically reduces variance. The distinction is whether the critic is used only as a baseline or also as a bootstrap target.
PPO's empirical performance is partly from implementation tricks, not the clipping theorem
The PPO-Clip objective is the theoretical contribution. Practitioner PPO includes extras that the original paper states briefly or not at all: advantage normalization per minibatch (subtract mean, divide by std), value function clipping, reward scaling, orthogonal weight initialization, learning-rate annealing, and gradient clipping at 0.5. Engstrom et al. (2020, ICLR, arXiv:2005.12729), "Implementation Matters in Deep Policy Gradients," removed these code-level optimizations and found that most of PPO's reported improvement over TRPO disappeared. Andrychowicz et al. (2021, ICLR, arXiv:2006.05990), "What Matters for On-Policy Deep Actor- Critic Methods?", ran 250k+ training runs and found that advantage normalization and the specific value-loss clipping choice were among the most load-bearing hyperparameters. Takeaway: if you reimplement PPO from the clipping equation alone, you will not reproduce the benchmark numbers. The details are not cosmetic.
PPO clipping caps commitment, not recovery
The min-operator in PPO-Clip is a pessimistic bound. It zeros the gradient only when the policy has already moved in the correct direction for the current advantage. The four cases:
- and : clipped, gradient zero. The good action is already more likely than under the old policy, so stop reinforcing it.
- and : unclipped, gradient flows. The good action has been made less likely; recover it.
- and : clipped, gradient zero. The bad action is already as likely; stop suppressing.
- and : unclipped, gradient flows. The bad action has been made more likely; push it back down.
The clip caps rewards for actions you have already committed to, but never caps corrective gradients when the policy is drifting the wrong way. Schulman et al. (2017, arXiv:1707.06347, Section 3) state this as a "pessimistic bound on the unclipped objective." Readers who memorize "clip at " often collapse the four cases into one and lose the recovery property.
Summary
- Actor-critic = policy network (actor) + value network (critic)
- Advantage centers the policy gradient, reducing variance
- GAE: exponentially weighted multi-step TD errors, controls bias-variance
- PPO: clipped surrogate objective, simple first-order trust region approximation
- SAC: maximum entropy RL, adds entropy bonus for exploration, off-policy
- Actor-critic is the dominant paradigm: PPO for LLMs, SAC for robotics
Exercises
Problem
Compute the GAE advantage for the 3-step trajectory
Problem
In the PPO clipped objective with , if and , what is the clipped objective value? What gradient signal does the actor receive?
Problem
In the PPO clipped objective with , you observe and . Which term of the is selected, and in which direction does the gradient push ? Compare to the case and .
Problem
Why does SAC use two Q-networks and take the minimum of their predictions? Relate this to the overestimation problem in Q-learning.
Related Comparisons
References
Canonical:
- Konda & Tsitsiklis, "Actor-Critic Algorithms" (NeurIPS 2000). First rigorous convergence analysis of two-timescale actor-critic.
- Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., Ch. 13 "Policy Gradient Methods" (2018). Canonical textbook treatment of REINFORCE, actor-critic, and the policy gradient theorem.
- Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (ICLR 2016, arXiv:1506.02438). Introduces GAE.
- Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (ICML 2016, arXiv:1602.01783). A3C and the synchronous A2C variant.
Current:
- Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (2017, arXiv:1707.06347). PPO.
- Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL" (ICML 2018, arXiv:1801.01290); with v2 update "Soft Actor-Critic Algorithms and Applications" (2018, arXiv:1812.05905). SAC and automatic entropy tuning.
- Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO" (ICLR 2020, arXiv:2005.12729). Shows most of PPO's empirical advantage comes from code-level optimizations, not the clipping theorem.
- Andrychowicz, Raichuk, Stańczyk, Orsini, Girgin, Marinier, Hussenot, Geist, Pietquin, Michalski, Gelly, Bachem, "What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study" (ICLR 2021, arXiv:2006.05990). 250k+ training runs isolating each hyperparameter's effect.
- Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022, arXiv:2203.02155). RLHF with PPO.
- Clemente, Castejón, Chandra, "Efficient Parallel Methods for Deep Reinforcement Learning" (2017, arXiv:1708.05144). Synchronous A2C matches asynchronous A3C.
Next Topics
The natural next steps from actor-critic methods:
- RLHF and alignment: applying PPO to language model training from human preferences
- DPO vs. GRPO vs. RL reasoning: modern alternatives and extensions to the PPO-based RLHF pipeline
Last reviewed: April 24, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Q-Learninglayer 2 · tier 1
- Policy Gradient Theoremlayer 3 · tier 1
- Temporal Difference Learninglayer 2 · tier 2
- Reward Systems and Reinforcement Learning Neurosciencelayer 4 · tier 3
Derived topics
5- DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2
- Policy Optimization: PPO and TRPOlayer 3 · tier 2
- RLHF and Alignmentlayer 4 · tier 2
- DPO vs GRPO vs RL for Reasoninglayer 5 · tier 2
- Deep RL for Controllayer 4 · tier 3