Agentic RL and Tool Use

Sneiderman, Robby

RL Theory

Agentic RL and Tool Use

LLMs as multi-step policies: observations, tool calls, environment feedback, sparse rewards, credit assignment, and why agent training differs from single-turn RLHF.

AdvancedTier 2FrontierFrontier watch~60 min

Prerequisites

Markov Decision Processes Policy Gradient Theorem Offline Reinforcement Learning Video World Models

Quiz (3)Prereq Map

Why This Matters

Standard RLHF trains a model to produce a good response to a prompt. Tool-using agents create a harder problem: the model chooses actions over many steps, observes external feedback, and decides when the task is complete.

A chat model is close to a single-turn function: $\text{prompt} \to \text{response}$ . A tool-using agent is better modeled as a multi-step policy: $(\text{history}_t, \text{observation}_t) \to \text{action}_t$ , where actions can include code execution, web navigation, file edits, API calls, text responses, and termination.

The RL challenges are harder: horizons are longer, rewards are sparser, actions have real consequences (a wrong API call cannot be undone), and the state space includes the external world. Understanding the formal RL framework for agents explains why building reliable agents is much harder than building good chatbots.

Mental Model

Consider the difference between:

Chat model: "Explain why this test failed." $\to$ One response. Done.
Agent: "Fix the failing test suite." $\to$ Read the failure. Inspect files. Edit code. Run tests. Interpret the next failure. Repeat until the local evidence says the task is complete.

The agent must decide what to do next at every step, handle failures (a search returns no results, an API errors out), and manage a growing context of past actions and observations. This is a sequential decision problem: the setting RL was built to analyze.

The Agent as an MDP

Proposition

LLM Agent as a Markov Decision Process

Statement

An LLM agent can be formulated as a partially observable MDP (POMDP):

State $s_t$ : The full environment state (file system contents, web page state, conversation history, tool outputs). Typically not fully observable.
Observation $o_t$ : What the agent sees. The text representation of the current state (tool output, error message, retrieved content).
Action $a_t$ : The agent's next output. A tool call (code execution, web search, API request), a text response, or a decision to terminate.
Transition $P(s_{t+1} | s_t, a_t)$ : The environment dynamics (code execution results, web page responses). Stochastic and partially known.
Reward $R(s_t, a_t)$ : Typically sparse. A final reward at task completion (did the agent solve the problem?) with zero intermediate reward.

The agent's policy $\pi_\theta(a_t | o_1, \ldots, o_t)$ is the LLM itself: given the history of observations, it generates the next action as a text string.

The objective is to maximize the expected cumulative reward:

$J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t)\right]$

where $T$ is the (variable) episode length and $\gamma$ is the discount factor.

Intuition

The LLM is the policy component. It processes observations (text from the environment), chooses actions (which tool to call, what arguments to use), and updates its next decision based on results. The context window is the agent's working memory: it contains the visible history of actions and observations. When the context window fills up, the agent must compress or summarize, which introduces information loss.

Why It Matters

This formulation connects LLM agents to the vast RL theory literature. Concepts like exploration-exploitation tradeoff, credit assignment, temporal abstraction, and reward shaping all apply directly. The formulation also reveals why agents are hard: long horizons, sparse rewards, and partial observability are exactly the settings where RL struggles most.

Failure Mode

The POMDP formulation assumes the agent's policy is Markov given the full observation history. In practice, the LLM has a finite context window, so it cannot condition on arbitrarily long histories. When episodes exceed the context length, the agent loses access to early observations and actions. This is not just a technical limitation; it means agents cannot be truly Markov for long tasks, introducing a systematic source of error.

report a correction →

Tool Use as Actions

Definition

Tool-Augmented LLM

A tool-augmented LLM has access to a set of tools $\mathcal{T} = \{t_1, \ldots, t_K\}$ , each with a typed interface (input schema, output schema). At each step, the agent either:

Calls a tool: Generates a structured tool call $(t_k, \text{args})$ , receives the tool output $o_{t+1} = t_k(\text{args})$
Generates text: Produces a text response (reasoning, answer, etc.)
Terminates: Signals task completion

The action space is $\mathcal{A} = \mathcal{T} \times \text{Args} \cup \text{Text} \cup \{\text{stop}\}$ .

Common tool categories:

Code execution: Python interpreter, shell commands
Web search: Search engine queries, page retrieval
API calls: Database queries, external service requests
File operations: Read, write, edit files
UI interaction: Browser navigation, click, type

The tool set defines the agent's action space. A model with code execution can test hypotheses that would otherwise remain verbal. A model with browser access can retrieve information outside its weights. A model with file editing can change an artifact and observe the consequences. The choice of tools is a product and safety decision, not just an API detail.

The ReAct Pattern

Definition

ReAct (Reasoning + Acting)

The ReAct pattern interleaves reasoning (chain-of-thought) with acting (tool calls) in an explicit loop:

Thought: The failing test names parseInterval, so I should inspect that code.
Action: read_file("src/time/parse-interval.ts")
Observation: The parser handles hours and minutes but not mixed units.
Thought: The failing case probably needs a mixed-unit branch plus a test.
Action: edit_file("src/time/parse-interval.ts", patch)
Observation: File updated.
Thought: Now I should run the narrow test before touching anything else.
Action: run_tests("parse-interval.test.ts")
Observation: The failing test now passes.
Thought: I can stop after checking the relevant suite.
Action: respond("Fixed parseInterval and verified the targeted tests.")

Each cycle consists of: Thought (reasoning about what to do), Action (tool call or response), Observation (result from the environment). The LLM generates the Thought and Action; the environment generates the Observation.

ReAct remains a common baseline pattern because it:

Makes reasoning explicit and inspectable
Allows the agent to plan before acting
Provides a natural structure for multi-step problem solving
Exposes the plan and action trace for debugging

The limitation: explicit reasoning traces consume context and can become unreliable evidence about the model's internal computation. For long tasks, the growing history of thoughts, actions, and observations must be compressed, summarized, or replaced by external memory.

Training Agentic Policies

Proposition

Policy Gradient for Tool-Augmented Agents

Statement

For an agent executing a trajectory $\tau = (o_0, a_0, o_1, a_1, \ldots, o_T, a_T)$ with episode reward $R(\tau)$ , the policy gradient is:

$\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | o_{\leq t})\right]$

Each action $a_t$ is a sequence of tokens (the tool call or text output), so:

$\nabla_\theta \log \pi_\theta(a_t | o_{\leq t}) = \sum_{k=1}^{|a_t|} \nabla_\theta \log \pi_\theta(a_t^{(k)} | o_{\leq t}, a_t^{(<k)})$

The gradient reinforces entire action sequences (tool calls with arguments) that led to successful episodes.

Intuition

The policy gradient pushes the agent to repeat actions that led to high reward and avoid actions that led to low reward. But the reward comes only at the end of a long episode. Which of the 20 actions was responsible for success? This is the credit assignment problem: the fundamental difficulty of RL with sparse rewards over long horizons.

Why It Matters

This is the mathematical framework for training agents with RL. It shows why agentic RL is harder than chat RLHF: the sum over $T$ timesteps introduces high variance, the sparse reward $R(\tau)$ provides weak signal per action, and the combinatorial action space (all possible tool calls with all possible arguments) is enormous.

Failure Mode

With sparse rewards and long horizons, the REINFORCE estimator has extremely high variance. A 20-step episode with binary reward gives each action a gradient proportional to the same episode-level reward, regardless of whether that specific action contributed to success. Variance reduction techniques (baselines, advantage estimation) help but do not fully solve the problem. This is why most agentic RL systems supplement sparse rewards with shaped intermediate rewards (e.g., partial credit for intermediate progress).

report a correction →

How Agentic RL Differs from Chat RLHF

Property	Chat RLHF	Agentic RL
Horizon	1 turn (single response)	5-100+ turns
Reward	Dense (reward per response)	Sparse (reward at task completion)
Actions	Text generation	Tool calls + text
State	Fixed prompt	Evolving environment
Consequences	None (just text)	Real (code runs, files change)
Failure recovery	N/A	Must handle errors and retry
Credit assignment	Trivial (one action)	Hard (many actions)

The key distinction: single-turn chat RLHF is closer to a contextual bandit, while agentic RL is a sequential decision problem with exploration, credit assignment, partial observability, and long horizons.

Multimodal Agents

Agent evaluation increasingly extends beyond text-only prompts to multimodal and interactive environments:

UI agents: Navigate graphical interfaces by observing screenshots or DOM state and producing clicks, typing actions, and browser navigation
Embodied agents: Interact with physical or simulated environments
Multi-tool agents: Combine code execution, web browsing, file editing, and API calls in a single episode

Multimodal agents process visual observations (screenshots) alongside text, expanding the observation space and adding new action types (click at coordinates, scroll, type into a field).

Common Fake Understanding

"Agents" does not mean AGI. In this page, an LLM agent means a language model used as a policy in a multi-step decision loop with tools and feedback. That may be prompted, supervised, preference-trained, or RL-trained. It is not a claim about consciousness or human-like autonomy. Treating "agentic" as synonymous with "autonomous in the human sense" leads to confused safety analysis and inflated capability claims.

Training Infrastructure

Training agentic policies requires infrastructure beyond standard LLM training:

Environments: Sandboxed execution environments for code, browsers, APIs. Each training episode requires spinning up and tearing down an environment instance.
Trajectory collection: Episodes are collected by running the agent in the environment, which is much slower than sampling text (tool calls have latency, code execution takes time).
Reward functions: Task-specific reward functions that check whether the agent completed the objective. Often hand-crafted per task category.
Safety constraints: The agent must not perform irreversible harmful actions during training (delete important files, send unauthorized emails). Sandboxing is essential.

April 2026 Review: What Is Established

The reliable center of the literature is not that "agents" are one settled architecture. It is that several older ideas now meet in the same system: interactive environments, tool calling, language-conditioned policies, trajectory evaluation, and delayed rewards.

The strongest evidence comes from task suites where success can be checked: web navigation environments, software-issue repair, coding tests, question answering with references, and constrained interactive benchmarks. Those benchmarks are useful because they force the model to act, observe, and revise. They are not universal measures of autonomy. Each benchmark bakes in a tool set, a time budget, an evaluator, and a definition of success.

The practical standard for this topic should be: report the environment, tool budget, scaffold, evaluator, and failure recovery policy. Without those details, an "agent" result is hard to interpret.

Common Confusions

Watch Out

Tool prompting is not the same as agentic RL

Prompting a model with tool descriptions and examples is not RL. It is in-context learning. The model uses its pretrained knowledge to guess how to use tools. Agentic RL actually updates the model's weights based on success or failure in the environment. The distinction matters because an impressive tool transcript is not evidence that the policy was trained from environment rollouts.

Watch Out

Function calling is not the same as agentic reasoning

Function calling (structured tool invocation) is a single action. Agentic reasoning is the ability to plan a sequence of actions, observe results, adapt the plan, handle failures, and decide when to stop. A model that can call functions is not necessarily an agent. It may just be a better-formatted chatbot. The "agentic" property is about multi-step sequential decision-making, not single-step tool invocation.

Watch Out

Longer context does not solve the horizon problem

A longer context window helps the agent remember more of its history, but it does not solve the RL challenges of credit assignment and exploration. Even with infinite context, the agent still needs to figure out which of its many actions was responsible for success (credit assignment) and decide whether to try new strategies versus exploit known ones (exploration). These are fundamental RL problems, not context length problems.

Summary

LLM agents can be modeled as policies: observation in, action out, multi-step episodes
Agent MDP: state = environment, action = tool call or text, reward = task completion
Tool use defines the action space: code execution, web search, APIs, UI
ReAct pattern: interleave reasoning (Thought) with acting (Action) and observing (Observation)
Agentic RL is harder than chat RLHF: longer horizons, sparser rewards, real consequences
Credit assignment over long episodes is the core difficulty
Policy gradient for agents: REINFORCE over multi-step trajectories with high variance
"Agent" means a model in an action-feedback loop, not AGI or consciousness
Training requires sandboxed environments and task-specific reward functions

Exercises

ExerciseCore

Problem

An LLM agent solves a coding task in 10 steps: 8 actions are code edits and 2 are test executions. The final test passes (reward = 1). Under REINFORCE without a baseline, what gradient does each action receive? Why is this problematic?

ExerciseAdvanced

Problem

Compare the effective action space of a chat model (single-turn RLHF) versus an agentic model with 5 tools, each taking a string argument of up to 100 tokens. Assuming a vocabulary of 50,000 tokens, estimate the action space sizes and explain the implications for exploration.

ExerciseResearch

Problem

The credit assignment problem in agentic RL can be partially addressed by hindsight analysis: after a successful episode, identify which actions were critical by counterfactual reasoning. Formalize this: define a "criticality score" for action $a_t$ in a successful trajectory, and describe how you would estimate it using the model itself.

References

Pre-canonical:

Sutton, "Temporal Credit Assignment in Reinforcement Learning", PhD thesis, UMass Amherst (1984). The credit-assignment problem that tool-using agents inherit across long multi-step trajectories.
Harutyunyan et al., "Hindsight Credit Assignment" (NeurIPS 2019). Counterfactual credit reweighting relevant to sparse-reward tool-use trajectories.

Canonical:

Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT, 2022). The instruction-following foundation that many tool-using LLM systems build on.
Nakano et al., "WebGPT: Browser-assisted question-answering with human feedback" (2021). Browser-assisted question answering with references, imitation learning, and human feedback.
Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (ICLR 2023). The interleaved reasoning-acting prompting pattern.
Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (NeurIPS 2023). Self-supervised tool-call insertion.

Current:

Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (NeurIPS 2023). Verbal self-critique and memory as a non-weight-update feedback mechanism.
Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models" (2023). Skill-library growth in an embodied environment.
Park et al., "Generative Agents: Interactive Simulacra of Human Behavior" (UIST 2023). Memory and reflection architectures for persistent social simulation.
Zhou et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents" (2024). Benchmark for realistic web navigation tasks.
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (ICLR 2024). Real-world software-engineering agent benchmark.
Liu et al., "AgentBench: Evaluating LLMs as Agents" (ICLR 2024). Cross-domain agent evaluation.
Mialon et al., "GAIA: a benchmark for General AI Assistants" (ICLR 2024). Tool-use and reasoning benchmark for general assistants.
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025). RL-trained reasoning with rule-based rewards; relevant to agent training, but not itself a full tool-use benchmark.

Next Topics

The natural next steps from agentic RL:

Post-training overview: how agent capabilities are built into the training pipeline
Test-time compute and search: search strategies that agents use at inference time

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Markov Decision Processeslayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Offline Reinforcement Learninglayer 3 · tier 2
World Models and Planninglayer 4 · tier 2
Video World Modelslayer 5 · tier 2

Derived topics

4

Post-Training Overviewlayer 5 · tier 2
Test-Time Compute and Searchlayer 5 · tier 2
Tool-Augmented Reasoninglayer 5 · tier 2
Agent Protocols: MCP and A2Alayer 5 · tier 3

Graph-backed continuations

Post-Training Overview Test-Time Compute and Search Agent Protocols: MCP and A2A Tool-Augmented Reasoning