Daily Paper

Toolformer: Language Models Can Teach Themselves to Use Tools

Demonstrates that language models can learn to autonomously decide when and how to call external tools (calculators, search engines, APIs) by self-generating tool-use training data, establishing a paradigm for agentic AI with tool access.

arXiv:2302.04761 Empirical Study

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu et al.

tool-useagentic-aiself-supervised-learningapi-interactionautonomous-systemscapability-expansion

Toolformer: Language Models Can Teach Themselves to Use Tools

Focus: Schick et al. showed that a language model could learn to autonomously call external APIs — including search engines, calculators, translators, and calendars — by generating its own training data for when tool calls are beneficial. This self-bootstrapped tool use represented a significant step toward agentic AI systems with real-world interaction capabilities.


Key Insights

  • Self-supervised tool-use acquisition. Toolformer generated its own tool-use annotations by sampling API calls at various positions in text, then filtering to retain only those calls that improved the model’s predictions. The model learned not just how to use tools but when tool use was appropriate.

  • Tool access expands the capability boundary. With access to a calculator, Toolformer solved arithmetic problems that pure language models consistently failed. With search access, it answered factual questions with higher accuracy. Tool access is a capability multiplier — which means restricting tool access is critical for safety.

  • Autonomous tool selection introduces new failure modes. The model independently decided when to invoke tools and how to interpret their outputs. Incorrect tool invocations created failure modes absent from pure language generation.

Executive Summary

Toolformer started with a pre-trained language model (GPT-J 6.7B) and taught it to use five external tools:

  1. Question-answering system — for factual queries
  2. Calculator — for arithmetic operations
  3. Wikipedia search — for knowledge retrieval
  4. Machine translation — for cross-lingual tasks
  5. Calendar — for temporal reasoning

Training Methodology

The training process involved:

  • Annotation. A large text corpus was annotated with potential API calls at positions where a tool might be helpful.

  • Execution. Those API calls were actually executed and results inserted.

  • Filtering. Only annotations where the tool call reduced the model’s perplexity on subsequent tokens were retained for training.

Results

The resulting model significantly outperformed the base GPT-J and even GPT-3 (175B) on tasks aligned with available tools. On mathematical reasoning, Toolformer solved problems beyond the capability of any pure language model at the time.

The Safety Transition Point

The paper demonstrated a key transition in language model capabilities: from passive text generators to active agents that interact with external systems. This shift had profound implications for safety:

  • A Toolformer-like system with access to email, code execution, or file systems could take real-world actions.

  • Safety containment becomes qualitatively more difficult when the model’s actions extend beyond text production.

  • The model’s autonomous decision about when to use tools means developers cannot fully predict or control the model’s interactions with external systems.

Relevance to Failure-First

Toolformer represents the inflection point where language model safety shifts from content generation concerns to action safety concerns:

  • Content-to-action transition. This is the same transition that defines embodied AI. When models can take actions in the world, failures become consequential rather than merely offensive.

  • Capability overhang. The model may access tools in ways developers did not anticipate, creating emergent risk from the combination of tool access and model reasoning.

  • Evaluation complexity. Evaluating the safety of a tool-using agent requires testing not just what it says but what it does — a much larger and more consequential space.


Read the full paper on arXiv · PDF