LLM Construction
Tool-Augmented Reasoning
LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, function calling and MCP for structured invocation, and code-as-thought for replacing verbal arithmetic with executed programs.
Prerequisites
Why This Matters
LLMs are unreliable at arithmetic, symbolic manipulation, and factual recall. A 70B parameter model asked to multiply two 5-digit numbers will frequently produce the wrong answer. But a model that calls a calculator gets it right every time. Tool-augmented reasoning is the practical strategy for converting unreliable verbal computation into reliable verified computation.
The core observation: LLMs are good at deciding what to compute but bad at doing the computation. Tools handle the computation. The model handles the planning.
Mental Model
Think of the LLM as a planner that decomposes a problem into steps, some of which require external execution. When the model encounters a subproblem it cannot solve reliably (arithmetic, database lookup, code execution), it emits a tool call, receives the result, and continues reasoning with that result as given. The model's job shifts from being an oracle to being a dispatcher.
Retrieval-augmented generation is a special case of tool-augmented reasoning where the tool is a dense or sparse retriever (Lewis et al. 2020). The same dispatch pattern applies: the model decides when external knowledge is needed, issues a query, and conditions on the returned passages.
Formal Setup
Let denote a language model and a set of available tools. Each tool takes a text input and returns a text output. A tool-augmented generation produces a sequence of interleaved reasoning tokens and tool calls:
where are reasoning segments, are tool arguments, and are tool outputs inserted back into the context.
Tool-Augmented Language Model
A tool-augmented LM is a pair where is a language model trained or prompted to emit structured tool calls, and is a set of external functions the model can invoke during generation. The model generates tokens autoregressively but may pause generation, call a tool, receive output, and resume generation conditioned on the tool output.
Toolformer: Self-Supervised Tool Learning
Toolformer (Schick et al., 2023) trains a model to decide when and how to call tools, using self-supervision rather than human demonstrations.
The procedure:
- Start with a pretrained LM and a set of APIs (calculator, search, calendar, etc.)
- For each training example, sample candidate tool calls at each position
- Execute each candidate call and check if inserting the result reduces loss on subsequent tokens
- Keep the calls that reduce loss; discard the rest
- Fine-tune the model on the augmented dataset
The key insight: the model learns to call tools exactly when doing so improves its own next-token prediction. No human annotation of "when to use a tool" is needed.
Toolformer Filtering Criterion
A candidate tool call at position is kept if and only if:
where is the loss on tokens following position without the tool call, is the loss when the tool result is included, and is a threshold. The call is useful when the loss without the tool exceeds the loss with the tool by at least .
Gorilla: Learning to Invoke APIs
Gorilla (Patil et al., 2023) is the foundational academic work on training LLMs to invoke APIs from natural-language descriptions. The authors fine-tune LLaMA on a corpus of API documentation paired with usage examples covering HuggingFace, TorchHub, and TensorFlow Hub. Gorilla outputs syntactically correct API calls for queries the model has never seen, including retrieval-aware variants that condition on the current documentation to handle API version drift. Gorilla established that invoking APIs can be framed as a supervised learning problem over (intent, documentation, call) triples rather than purely as in-context prompting.
Function Calling and MCP
Two industry protocols make tool invocation structured and portable.
Function calling (OpenAI, 2023) is an API convention where the developer declares tool signatures as JSON schemas and the model emits a structured call object with the selected tool name and argument values. The model no longer writes free-form tool syntax inside its reasoning; instead, the serving stack returns a typed call that the client executes. Anthropic, Google, and open-source runtimes (vLLM, llama.cpp) have since adopted compatible tool-calling schemas.
Model Context Protocol (Anthropic, 2024; https://modelcontextprotocol.io/) is an open standard for connecting LLMs to external tools and data sources. MCP separates tool providers (servers exposing resources, prompts, and tools) from tool clients (LLM hosts like Claude Desktop or IDE plugins) over a JSON-RPC transport. Where function calling standardizes the call format inside a single API session, MCP standardizes how tools and context are discovered, authenticated, and shared across hosts.
ReAct: Interleaved Reasoning and Acting
ReAct (Yao et al., 2023) structures tool-augmented generation as an alternation of thought steps and action steps.
The format:
- Thought: the model reasons about what it knows and what it needs
- Action: the model calls a tool (search, lookup, calculate)
- Observation: the tool returns a result
- Repeat until the model emits a final answer
ReAct is a prompting strategy, not a training method. It works with any instruction-following model. The explicit thought steps serve two purposes: they give the model space to plan, and they make the reasoning chain interpretable to humans.
ReAct is an empirical observation, not a proven expressivity result. On interactive benchmarks where external state is queryable (HotpotQA, ALFWorld), interleaving reasoning with actions outperforms pure chain-of-thought because the model can ground intermediate claims in retrieved facts. It is not the case that ReAct is provably more capable than CoT in a formal sense; the gains depend on the benchmark admitting useful tool calls.
Yao et al. (2023, Table 2) report that ReAct outperforms chain-of-thought by roughly 6 exact-match points on HotpotQA. On ALFWorld (Table 3), ReAct substantially outperforms the act-only baseline on decision-making tasks. Chain-of-thought can reason but hallucinates facts. Act-only retrieves facts but does not plan multi-step reasoning. ReAct combines both: think about what you need, retrieve it, then reason with the retrieved facts.
The thought-action-observation loop is the backbone of most modern agent frameworks (LangChain, AutoGPT, and successors). A common failure mode is that the model formulates a bad query, the retrieval tool returns irrelevant results, and the model incorporates those results into its reasoning and produces a confident but wrong answer. ReAct also increases token cost substantially because of the verbose thought-action format.
Reflexion and Self-Refine: Feedback-Driven Revision
Two follow-ups extend ReAct with structured self-correction.
Reflexion (Shinn et al., 2023, NeurIPS; arXiv:2303.11366) treats the environment's reward signal as verbal feedback. After a failed trajectory, the model writes a natural-language reflection on what went wrong and stores it in an episodic memory buffer. The next episode conditions on the reflection. This is reinforcement learning with verbal, not gradient, updates: no weights change, but policy behavior improves across episodes because the context grows richer.
Self-Refine (Madaan et al., 2023, NeurIPS; arXiv:2303.17651) applies the same feedback-then-revise loop at a single-step level. The model produces an answer, critiques its own answer, then rewrites. Unlike Reflexion, Self-Refine does not require an external reward signal; the critique comes from the same model. It works when the model can recognize errors it cannot avoid making on the first pass.
Computer Use: Tool-Augmented LLMs Controlling GUIs
The current deployment frontier is tool-augmented LLMs that control arbitrary graphical interfaces. Anthropic shipped a computer-use API (2024) that lets Claude take screenshots, move a cursor, click, and type. Google DeepMind's Project Mariner and OpenAI's Operator (2025) pursue the same pattern: a vision-language model receives pixels, emits keyboard and mouse actions, and iterates. This generalizes function calling from a typed schema of structured tools to the universal tool of a screen and input devices. The theoretical picture is the same as ReAct (thought, action, observation), but the action space is continuous pixel coordinates and the observation space is a rendered desktop.
Code-as-Thought
A specific and powerful form of tool augmentation: instead of reasoning verbally about a computation, generate code, execute it, and use the output.
For mathematical problems, code execution is strictly more reliable than verbal chain-of-thought. A model asked "What is the 50th Fibonacci number?" can either attempt mental arithmetic (error-prone) or write a 3-line Python program (exact).
Code Execution Eliminates Computation Errors on Verifiable Tasks
Statement
Let be the probability that a model produces the correct answer via verbal reasoning, and let be the probability that it generates correct code. On tasks where correctness is verifiable (arithmetic, symbolic manipulation, data processing):
where accounts for the model retrying after execution errors. Empirically, for computation-heavy benchmarks, . Tool-augmented systems achieve top scores on MATH when equipped with verified computation (Wolfram and Python execution, Lean theorem prover); exact figures vary by system and date and should be read from the current leaderboard at https://paperswithcode.com/sota/math-word-problem-solving-on-math.
Intuition
Verbal reasoning requires the model to be a calculator, a symbolic algebra system, and a programmer all at once, using only next-token prediction. Code execution offloads the computation to a correct-by-construction runtime. The model only needs to specify the computation, not perform it.
Why It Matters
Adding code execution is among the largest measured accuracy gains of any single intervention for LLM reasoning on computational benchmarks. For any task involving computation, code-as-thought should be the default strategy. Lean-verified pipelines (Kimina, DeepSeek-Prover variants) extend the idea from executing numerical code to executing proof scripts, turning mathematical claims into machine-checked theorems.
Failure Mode
Code execution fails when the model writes incorrect code that still runs without error (logical bugs, not syntax errors). It also fails on tasks that are not easily expressible as code: common sense reasoning, ethical judgments, creative writing. The model must also handle the case where code produces runtime errors, which requires the ability to debug and retry.
Why Tool Use Improves Reliability on Verifiable Tasks
The key property is verifiability. For tasks where correctness can be checked (math: does the answer satisfy the equation? code: does it pass tests? search: does the source confirm the claim?), tool use converts an unreliable stochastic process (LLM generation) into a reliable deterministic one (tool execution).
This does not help for tasks without clear verification criteria. Generating a persuasive essay cannot be verified by a tool. Multiplying two matrices can.
Common Confusions
Tool augmentation is not fine-tuning
Toolformer and Gorilla require fine-tuning, but most deployed tool-augmented systems (ReAct, function calling, MCP, code interpreter) work via prompting or API design. The model does not need to be retrained to use tools; it needs to be instructed on the tool format and given examples.
More tools do not always help
Adding irrelevant tools increases the probability that the model calls the wrong one. Tool selection is itself a reasoning task that can fail. Systems with 3-5 well-chosen tools typically outperform systems with 50 poorly documented ones.
Tool use does not eliminate hallucination
A model can still hallucinate in the reasoning steps between tool calls. It can also misinterpret tool outputs. Tool use reduces hallucination on verifiable subproblems but does not address hallucination in planning, interpretation, or synthesis steps.
ReAct is not proved superior to CoT
ReAct empirically outperforms chain-of-thought on interactive benchmarks that admit useful tool calls. This is a measured regularity across HotpotQA, ALFWorld, and WebShop, not a formal expressivity theorem. On tasks that do not benefit from external state (closed-form math with no lookup), ReAct can be slower and no more accurate than CoT.
Summary
- LLMs are good at deciding what to compute, bad at doing the computation
- Toolformer: self-supervised learning of when to call tools, filtering by loss reduction
- Gorilla: fine-tuning to invoke APIs from natural-language intent
- Function calling and MCP: industry protocols for structured, portable tool invocation
- ReAct: thought-action-observation loop for interleaved reasoning and tool use (empirical, not proven)
- Reflexion and Self-Refine: feedback-then-revise loops that update context, not weights
- Computer Use: tool-augmented LLMs controlling GUIs via screenshot and action
- Code-as-thought: generate and execute code instead of verbal reasoning
- RAG is a special case: the retriever is the tool
- The improvement comes from verifiability: tools provide deterministic computation
- More tools is not always better; tool selection is itself an error-prone step
Exercises
Problem
A model achieves 60% accuracy on arithmetic word problems using verbal chain-of-thought. With code interpreter, it generates correct Python code 80% of the time. When the code has a bug, the model successfully debugs and retries with 50% probability. What is the overall accuracy with the code interpreter?
Problem
Design a Toolformer-style filtering experiment. You have a pretrained LM, a calculator API, and a dataset of math word problems. Describe the steps to determine which positions in the training text benefit from calculator calls. What loss function do you use for filtering, and what is the threshold ?
References
Canonical:
- Schick, Dwivedi-Yu, Dessi, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom, "Toolformer: Language Models Can Teach Themselves to Use Tools", NeurIPS 2023, arXiv:2302.04761
- Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao, "ReAct: Synergizing Reasoning and Acting in Language Models", ICLR 2023, arXiv:2210.03629
- Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Kuttler, Lewis, Yih, Rocktaschel, Riedel, Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", NeurIPS 2020, arXiv:2005.11401
API invocation and protocols:
- Patil, Zhang, Wang, Gonzalez, "Gorilla: Large Language Model Connected with Massive APIs", 2023, arXiv:2305.15334
- OpenAI, "Function calling and other API updates", 2023 (industry-standard structured tool-calling API), https://openai.com/index/function-calling-and-other-api-updates/
- Anthropic, "Model Context Protocol", 2024 (open standard for LLM-tool connectivity), https://modelcontextprotocol.io/
Feedback-driven revision:
- Shinn, Cassano, Gopinath, Narasimhan, Yao, "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS 2023, arXiv:2303.11366
- Madaan, Tandon, Gupta, Hallinan, Gao, Wiegreffe, Alon, Dziri, Prabhumoye, Yang, Gupta, Majumder, Hermann, Welleck, Yazdanbakhsh, Clark, "Self-Refine: Iterative Refinement with Self-Feedback", NeurIPS 2023, arXiv:2303.17651
Computer use / GUI control:
- Anthropic, "Introducing computer use", 2024, https://www.anthropic.com/news/3-5-models-and-computer-use
- Google DeepMind, "Project Mariner", 2024
- OpenAI, "Operator", 2025
Code-as-thought:
- Chen, Ma, Waheed, Wang, Yin, Gao, Yang, "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks", TMLR 2023, arXiv:2211.12588
- Gao, Madaan, Zhou, Alon, Liu, Yang, Callan, Neubig, "PAL: Program-Aided Language Models", ICML 2023, arXiv:2211.10435
- Parisi, Zhao, Fiedel, "TALM: Tool Augmented Language Models", 2022, arXiv:2205.12255
Benchmarks:
- Yang, Qi, Zhang, Benber, Bolton, Chen, Khot, Sabharwal, Hajishirzi, Smith, Yih, "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering", EMNLP 2018, arXiv:1809.09600
- Hendrycks, Burns, Kadavath, Arora, Basart, Tang, Song, Steinhardt, "Measuring Mathematical Problem Solving with the MATH Dataset", NeurIPS 2021, arXiv:2103.03874
Next Topics
The natural next steps from tool-augmented reasoning:
- Agent protocols (MCP, A2A): standardized interfaces for tool communication
- Structured output and constrained generation: ensuring tool calls are well-formed
- Multimodal RAG: retrieval as a tool over text, images, and tables
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Chain-of-Thought and Reasoninglayer 5 · tier 1
- Transformer Architecturelayer 4 · tier 2
- Agentic RL and Tool Uselayer 5 · tier 2
- Prompt Engineering and In-Context Learninglayer 5 · tier 2
Derived topics
2- Structured Output and Constrained Generationlayer 5 · tier 2
- Agent Protocols: MCP and A2Alayer 5 · tier 3
Graph-backed continuations