Tool-Augmented Reasoning

Sneiderman, Robby

LLM Construction

Tool-Augmented Reasoning

LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, function calling and MCP for structured invocation, and code-as-thought for replacing verbal arithmetic with executed programs.

AdvancedTier 2FrontierFrontier watch~55 min

Prerequisites

Transformer Architecture Chain of Thought and Reasoning Prompt Engineering and in Context Learning Agentic RL and Tool Use

Prereq Map

Why This Matters

LLMs are unreliable at arithmetic, symbolic manipulation, and factual recall. A 70B parameter model asked to multiply two 5-digit numbers will frequently produce the wrong answer. But a model that calls a calculator gets it right every time. Tool-augmented reasoning is the practical strategy for converting unreliable verbal computation into reliable verified computation.

The core observation: LLMs are good at deciding what to compute but bad at doing the computation. Tools handle the computation. The model handles the planning.

Mental Model

Think of the LLM as a planner that decomposes a problem into steps, some of which require external execution. When the model encounters a subproblem it cannot solve reliably (arithmetic, database lookup, code execution), it emits a tool call, receives the result, and continues reasoning with that result as given. The model's job shifts from being an oracle to being a dispatcher.

Retrieval-augmented generation is a special case of tool-augmented reasoning where the tool is a dense or sparse retriever (Lewis et al. 2020). The same dispatch pattern applies: the model decides when external knowledge is needed, issues a query, and conditions on the returned passages.

Formal Setup

Let $M$ denote a language model and $T = \{t_1, \ldots, t_k\}$ a set of available tools. Each tool $t_i$ takes a text input and returns a text output. A tool-augmented generation produces a sequence of interleaved reasoning tokens and tool calls:

$r_1, \texttt{[CALL } t_i(a_1)\texttt{]} \to o_1, r_2, \texttt{[CALL } t_j(a_2)\texttt{]} \to o_2, \ldots, r_n$

where $r_i$ are reasoning segments, $a_i$ are tool arguments, and $o_i$ are tool outputs inserted back into the context.

Definition

Tool-Augmented Language Model

A tool-augmented LM is a pair $(M, T)$ where $M$ is a language model trained or prompted to emit structured tool calls, and $T$ is a set of external functions the model can invoke during generation. The model generates tokens autoregressively but may pause generation, call a tool, receive output, and resume generation conditioned on the tool output.

Toolformer: Self-Supervised Tool Learning

Toolformer (Schick et al., 2023) trains a model to decide when and how to call tools, using self-supervision rather than human demonstrations.

The procedure:

Start with a pretrained LM and a set of APIs (calculator, search, calendar, etc.)
For each training example, sample candidate tool calls at each position
Execute each candidate call and check if inserting the result reduces loss on subsequent tokens
Keep the calls that reduce loss; discard the rest
Fine-tune the model on the augmented dataset

The key insight: the model learns to call tools exactly when doing so improves its own next-token prediction. No human annotation of "when to use a tool" is needed.

Definition

Toolformer Filtering Criterion

A candidate tool call $c$ at position $i$ is kept if and only if:

$L_i(e^-) - L_i(e^+) \geq \tau$

where $L_i(e^-)$ is the loss on tokens following position $i$ without the tool call, $L_i(e^+)$ is the loss when the tool result is included, and $\tau$ is a threshold. The call is useful when the loss without the tool exceeds the loss with the tool by at least $\tau$ .

Gorilla: Learning to Invoke APIs

Gorilla (Patil et al., 2023) is the foundational academic work on training LLMs to invoke APIs from natural-language descriptions. The authors fine-tune LLaMA on a corpus of API documentation paired with usage examples covering HuggingFace, TorchHub, and TensorFlow Hub. Gorilla outputs syntactically correct API calls for queries the model has never seen, including retrieval-aware variants that condition on the current documentation to handle API version drift. Gorilla established that invoking APIs can be framed as a supervised learning problem over (intent, documentation, call) triples rather than purely as in-context prompting.

Function Calling and MCP

Two industry protocols make tool invocation structured and portable.

Function calling (OpenAI, 2023) is an API convention where the developer declares tool signatures as JSON schemas and the model emits a structured call object with the selected tool name and argument values. The model no longer writes free-form tool syntax inside its reasoning; instead, the serving stack returns a typed call that the client executes. Anthropic, Google, and open-source runtimes (vLLM, llama.cpp) have since adopted compatible tool-calling schemas.

Model Context Protocol (Anthropic, 2024; https://modelcontextprotocol.io/) is an open standard for connecting LLMs to external tools and data sources. MCP separates tool providers (servers exposing resources, prompts, and tools) from tool clients (LLM hosts like Claude Desktop or IDE plugins) over a JSON-RPC transport. Where function calling standardizes the call format inside a single API session, MCP standardizes how tools and context are discovered, authenticated, and shared across hosts.

ReAct: Interleaved Reasoning and Acting

ReAct (Yao et al., 2023) structures tool-augmented generation as an alternation of thought steps and action steps.

The format:

Thought: the model reasons about what it knows and what it needs
Action: the model calls a tool (search, lookup, calculate)
Observation: the tool returns a result
Repeat until the model emits a final answer

ReAct is a prompting strategy, not a training method. It works with any instruction-following model. The explicit thought steps serve two purposes: they give the model space to plan, and they make the reasoning chain interpretable to humans.

ReAct is an empirical observation, not a proven expressivity result. On interactive benchmarks where external state is queryable (HotpotQA, ALFWorld), interleaving reasoning with actions outperforms pure chain-of-thought because the model can ground intermediate claims in retrieved facts. It is not the case that ReAct is provably more capable than CoT in a formal sense; the gains depend on the benchmark admitting useful tool calls.

Yao et al. (2023, Table 2) report that ReAct outperforms chain-of-thought by roughly 6 exact-match points on HotpotQA. On ALFWorld (Table 3), ReAct substantially outperforms the act-only baseline on decision-making tasks. Chain-of-thought can reason but hallucinates facts. Act-only retrieves facts but does not plan multi-step reasoning. ReAct combines both: think about what you need, retrieve it, then reason with the retrieved facts.

The thought-action-observation loop is the backbone of most modern agent frameworks (LangChain, AutoGPT, and successors). A common failure mode is that the model formulates a bad query, the retrieval tool returns irrelevant results, and the model incorporates those results into its reasoning and produces a confident but wrong answer. ReAct also increases token cost substantially because of the verbose thought-action format.

Reflexion and Self-Refine: Feedback-Driven Revision

Two follow-ups extend ReAct with structured self-correction.

Reflexion (Shinn et al., 2023, NeurIPS; arXiv:2303.11366) treats the environment's reward signal as verbal feedback. After a failed trajectory, the model writes a natural-language reflection on what went wrong and stores it in an episodic memory buffer. The next episode conditions on the reflection. This is reinforcement learning with verbal, not gradient, updates: no weights change, but policy behavior improves across episodes because the context grows richer.

Self-Refine (Madaan et al., 2023, NeurIPS; arXiv:2303.17651) applies the same feedback-then-revise loop at a single-step level. The model produces an answer, critiques its own answer, then rewrites. Unlike Reflexion, Self-Refine does not require an external reward signal; the critique comes from the same model. It works when the model can recognize errors it cannot avoid making on the first pass.

Computer Use: Tool-Augmented LLMs Controlling GUIs

The current deployment frontier is tool-augmented LLMs that control arbitrary graphical interfaces. Anthropic shipped a computer-use API (2024) that lets Claude take screenshots, move a cursor, click, and type. Google DeepMind's Project Mariner and OpenAI's Operator (2025) pursue the same pattern: a vision-language model receives pixels, emits keyboard and mouse actions, and iterates. This generalizes function calling from a typed schema of structured tools to the universal tool of a screen and input devices. The theoretical picture is the same as ReAct (thought, action, observation), but the action space is continuous pixel coordinates and the observation space is a rendered desktop.

Code-as-Thought

A specific and powerful form of tool augmentation: instead of reasoning verbally about a computation, generate code, execute it, and use the output.

For mathematical problems, code execution is strictly more reliable than verbal chain-of-thought. A model asked "What is the 50th Fibonacci number?" can either attempt mental arithmetic (error-prone) or write a 3-line Python program (exact).

Proposition

Code Execution Eliminates Computation Errors on Verifiable Tasks

Statement

Let $p_{\text{verbal}}$ be the probability that a model produces the correct answer via verbal reasoning, and let $p_{\text{code}}$ be the probability that it generates correct code. On tasks where correctness is verifiable (arithmetic, symbolic manipulation, data processing):

$P(\text{correct with tool}) = p_{\text{code}} + (1 - p_{\text{code}}) \cdot p_{\text{retry}}$

where $p_{\text{retry}}$ accounts for the model retrying after execution errors. Empirically, for computation-heavy benchmarks, $p_{\text{code}} \gg p_{\text{verbal}}$ . Tool-augmented systems achieve top scores on MATH when equipped with verified computation (Wolfram and Python execution, Lean theorem prover); exact figures vary by system and date and should be read from the current leaderboard at https://paperswithcode.com/sota/math-word-problem-solving-on-math.

Intuition

Verbal reasoning requires the model to be a calculator, a symbolic algebra system, and a programmer all at once, using only next-token prediction. Code execution offloads the computation to a correct-by-construction runtime. The model only needs to specify the computation, not perform it.

Why It Matters

Adding code execution is among the largest measured accuracy gains of any single intervention for LLM reasoning on computational benchmarks. For any task involving computation, code-as-thought should be the default strategy. Lean-verified pipelines (Kimina, DeepSeek-Prover variants) extend the idea from executing numerical code to executing proof scripts, turning mathematical claims into machine-checked theorems.

Failure Mode

Code execution fails when the model writes incorrect code that still runs without error (logical bugs, not syntax errors). It also fails on tasks that are not easily expressible as code: common sense reasoning, ethical judgments, creative writing. The model must also handle the case where code produces runtime errors, which requires the ability to debug and retry.

report a correction →

Why Tool Use Improves Reliability on Verifiable Tasks

The key property is verifiability. For tasks where correctness can be checked (math: does the answer satisfy the equation? code: does it pass tests? search: does the source confirm the claim?), tool use converts an unreliable stochastic process (LLM generation) into a reliable deterministic one (tool execution).

This does not help for tasks without clear verification criteria. Generating a persuasive essay cannot be verified by a tool. Multiplying two matrices can.

Common Confusions

Watch Out

Tool augmentation is not fine-tuning

Toolformer and Gorilla require fine-tuning, but most deployed tool-augmented systems (ReAct, function calling, MCP, code interpreter) work via prompting or API design. The model does not need to be retrained to use tools; it needs to be instructed on the tool format and given examples.

Watch Out

More tools do not always help

Adding irrelevant tools increases the probability that the model calls the wrong one. Tool selection is itself a reasoning task that can fail. Systems with 3-5 well-chosen tools typically outperform systems with 50 poorly documented ones.

Watch Out

Tool use does not eliminate hallucination

A model can still hallucinate in the reasoning steps between tool calls. It can also misinterpret tool outputs. Tool use reduces hallucination on verifiable subproblems but does not address hallucination in planning, interpretation, or synthesis steps.

Watch Out

ReAct is not proved superior to CoT

ReAct empirically outperforms chain-of-thought on interactive benchmarks that admit useful tool calls. This is a measured regularity across HotpotQA, ALFWorld, and WebShop, not a formal expressivity theorem. On tasks that do not benefit from external state (closed-form math with no lookup), ReAct can be slower and no more accurate than CoT.

Summary

LLMs are good at deciding what to compute, bad at doing the computation
Toolformer: self-supervised learning of when to call tools, filtering by loss reduction
Gorilla: fine-tuning to invoke APIs from natural-language intent
Function calling and MCP: industry protocols for structured, portable tool invocation
ReAct: thought-action-observation loop for interleaved reasoning and tool use (empirical, not proven)
Reflexion and Self-Refine: feedback-then-revise loops that update context, not weights
Computer Use: tool-augmented LLMs controlling GUIs via screenshot and action
Code-as-thought: generate and execute code instead of verbal reasoning
RAG is a special case: the retriever is the tool
The improvement comes from verifiability: tools provide deterministic computation
More tools is not always better; tool selection is itself an error-prone step

Exercises

ExerciseCore

Problem

A model achieves 60% accuracy on arithmetic word problems using verbal chain-of-thought. With code interpreter, it generates correct Python code 80% of the time. When the code has a bug, the model successfully debugs and retries with 50% probability. What is the overall accuracy with the code interpreter?

ExerciseAdvanced

Problem

Design a Toolformer-style filtering experiment. You have a pretrained LM, a calculator API, and a dataset of math word problems. Describe the steps to determine which positions in the training text benefit from calculator calls. What loss function do you use for filtering, and what is the threshold $\tau$ ?

References

Canonical:

Schick, Dwivedi-Yu, Dessi, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom, "Toolformer: Language Models Can Teach Themselves to Use Tools", NeurIPS 2023, arXiv:2302.04761
Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao, "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR 2023, arXiv:2210.03629
Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Kuttler, Lewis, Yih, Rocktaschel, Riedel, Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", NeurIPS 2020, arXiv:2005.11401

API invocation and protocols:

Patil, Zhang, Wang, Gonzalez, "Gorilla: Large Language Model Connected with Massive APIs", 2023, arXiv:2305.15334
OpenAI, "Function calling and other API updates", 2023 (industry-standard structured tool-calling API), https://openai.com/index/function-calling-and-other-api-updates/
Anthropic, "Model Context Protocol", 2024 (open standard for LLM-tool connectivity), https://modelcontextprotocol.io/

Feedback-driven revision:

Shinn, Cassano, Gopinath, Narasimhan, Yao, "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS 2023, arXiv:2303.11366
Madaan, Tandon, Gupta, Hallinan, Gao, Wiegreffe, Alon, Dziri, Prabhumoye, Yang, Gupta, Majumder, Hermann, Welleck, Yazdanbakhsh, Clark, "Self-Refine: Iterative Refinement with Self-Feedback", NeurIPS 2023, arXiv:2303.17651

Computer use / GUI control:

Anthropic, "Introducing computer use", 2024, https://www.anthropic.com/news/3-5-models-and-computer-use
Google DeepMind, "Project Mariner", 2024
OpenAI, "Operator", 2025

Code-as-thought:

Chen, Ma, Waheed, Wang, Yin, Gao, Yang, "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks", TMLR 2023, arXiv:2211.12588
Gao, Madaan, Zhou, Alon, Liu, Yang, Callan, Neubig, "PAL: Program-Aided Language Models", ICML 2023, arXiv:2211.10435
Parisi, Zhao, Fiedel, "TALM: Tool Augmented Language Models", 2022, arXiv:2205.12255

Benchmarks:

Yang, Qi, Zhang, Benber, Bolton, Chen, Khot, Sabharwal, Hajishirzi, Smith, Yih, "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering", EMNLP 2018, arXiv:1809.09600
Hendrycks, Burns, Kadavath, Arora, Basart, Tang, Song, Steinhardt, "Measuring Mathematical Problem Solving with the MATH Dataset", NeurIPS 2021, arXiv:2103.03874

Next Topics

The natural next steps from tool-augmented reasoning:

Agent protocols (MCP, A2A): standardized interfaces for tool communication
Structured output and constrained generation: ensuring tool calls are well-formed
Multimodal RAG: retrieval as a tool over text, images, and tables

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Chain-of-Thought and Reasoninglayer 5 · tier 1
Transformer Architecturelayer 4 · tier 2
Agentic RL and Tool Uselayer 5 · tier 2
Prompt Engineering and In-Context Learninglayer 5 · tier 2

Derived topics

2

Structured Output and Constrained Generationlayer 5 · tier 2
Agent Protocols: MCP and A2Alayer 5 · tier 3

Graph-backed continuations

Agent Protocols: MCP and A2A Structured Output and Constrained Generation