How LLMs Use Tools


I always think of Large Language Models (LLMs) as sophisticated text processors, which is technically true actually. You give them text, they give you text back. This simple mental model works well until you encounter terms like “tool calling” or the “Model Context Protocol (MCP).” Suddenly, these models aren’t just talking; they’re doing things like calling APIs, running code, and accessing live information.

This led me to a question: if all LLMs do is process and generate text, how can some of them use external tools while others can’t? What separates a model that can check the weather from one that can only write a poem about it?

It’s Not Just a Prompting Trick

My initial hypothesis was that “tool calling” was just a clever form of prompt engineering. I imagined a massive system prompt that painstakingly described a function in natural language, hoping the LLM would generate text that looked like a function call. And if it didn’t, we just roll the dice again. We developers then parse the output to extract the function name and arguments. While that can work to some extent, it’s brittle and unreliable. We are so used to deterministic code execution, but LLMs are probabilistic. They generate text, and that text can be anything.

To make function calling atomic, meaning that it either happens or it doesn’t, we use a dedicated, structured part of the API. Modern model providers like Anthropic’s Claude and OpenAI’s GPT-4 support a tools parameter in their API calls. This isn’t just a convenience offered by their SDK, it’s a first-class feature of the model’s interface.

When you make an API call, you pass a list of available tools, each defined with a clear structure:

  • name: A unique identifier for the tool.
  • description: A natural language explanation of what the tool does.
  • input_schema: A machine-readable schema (like JSON Schema) that defines the parameters the tool accepts.

Here is a simplified example of what this looks like in a request to an LLM:

{
  "model": "claude-3-5-sonnet-20240620",
  "messages": [
    {
      "role": "user",
      "content": "What's the weather like in London?"
    }
  ],
  "tools": [
    {
      "name": "get_weather",
      "description": "Get the current weather in a given location.",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA"
          }
        },
        "required": ["location"]
      }
    }
  ]
}

By providing this structured data outside of the main prompt, we give the model explicit knowledge of its available capabilities. This structured format is crucial for safety and reliability, as it ensures the model knows the exact signature of the tools it can call, dramatically reducing ambiguity.

Not All LLMs Are Created Equal

This discovery led to my next realization: not all models can handle a tools parameter. Tool calling is a specialized feature, not a universal LLM capability. My investigation revealed that many models, especially older or certain open-source versions, simply don’t support it. If you try to pass a tools argument to an unsupported model, you’ll get an error.

This is the core of the answer to my original question. The ability to use tools is a feature built into the model and its inference engine.

Model FamilyTool Calling Support
GPT-4 / GPT-4oYes
Claude 3 / 3.5Yes
Google GeminiYes
Llama 3.1Yes
Many older or smaller open-source modelsNo

This isn’t just a commercial feature, though. The open-source community has rapidly adopted it. Inference engines like Ollama now support tool-calling with compatible models like Llama 3.1 and Deepseek, often using an API format that is compatible with OpenAI’s, which makes it easier for developers to build applications that can work with different models.

More Than Just Training

So, what enables these models to use tools? My initial thought was simple: “It’s all in the training.” While that’s a huge part of the story, it’s not the whole picture. The magic happens at the intersection of specialized training and a purpose-built inference engine.

1. The Training: Learning the Language of Tools

For a developer, it’s helpful to think of tool-calling not as an abstract capability, but as the model learning a new, highly structured “language.” This is typically achieved through a process called Supervised Fine-Tuning (SFT).

During this phase, the model is trained on a massive, curated dataset where each example consists of:

  • A user’s prompt (e.g., “What’s the weather in London?”).
  • A list of available tool definitions (the same tools schema we saw earlier).
  • The desired output: a perfectly formatted, syntactically correct tool call (e.g., {"name": "get_weather", "arguments": {"location": "London, UK"}}).

By processing millions of these examples, the model doesn’t just learn to talk about calling a function; it learns the precise syntax required to generate the call. It learns to map fuzzy, natural language intent to the rigid, structured format defined by the input_schema. This is what bridges the gap between understanding a request and producing a machine-executable action.

2. The Inference Engine: The Runtime Orchestrator

This is the part I initially missed and is crucial for developers to understand. A trained model is just a set of weights; the inference engine is the software that runs it and brings it to life. For tool-calling, the inference engine is more than a simple text generator. It’s an active participant.

Here’s how it works at runtime:

  1. When you send a request with a tools parameter, the inference engine primes the model with this information.
  2. The model begins generating its response as usual. However, the inference engine is actively monitoring the output for a special signal—often a specific token or a structured format—that indicates the model wants to call a tool.
  3. When this signal is detected, the engine halts text generation.
  4. It then parses the generated text to extract the structured tool call (get_weather(...)).
  5. Finally, it packages this structured data and returns it to your application, pausing the conversation until it receives the tool’s result.

This is why not all models support tool use, even if they are powerful. The capability requires both a model trained to produce tool-calling syntax and an inference engine designed to recognize that syntax, halt generation, and hand control back to the developer’s code. It’s a symbiotic relationship between the model’s learned knowledge and the runtime environment’s logic.

The Practical Realities for a Developer

Understanding the “how” was one thing, but I also needed to know what this meant for building real applications. My research into the practical side of tool calling revealed a few key points.

First, even models trained for tool use aren’t perfect. They are probabilistic, meaning there’s always a small chance they will generate a malformed tool call or fail to call a tool when they should. Because of this, the responsibility for robust error handling and retries falls on the developer’s application, often called an “orchestrator.” Your code needs to validate the model’s output and decide what to do if it’s not what you expected.

Second, the process is designed to be “atomic” from the developer’s perspective. When you make an API call, one of two things will happen: the model will respond with plain text, or it will respond with a structured request to call a tool. You never see the model’s “internal struggle” or a half-formed tool call. If the model decides not to use a tool, it’s as if the tool never existed for that turn. This design simplifies the developer’s job significantly.

Tying It All Together with MCP

Finally, my journey led me to the Model Context Protocol (MCP). I learned that MCP is a standardized protocol designed to make the connection between LLMs and tools more robust and interoperable. Think of it as a universal adapter for AI tools.

An MCP server can host a collection of tools, and any MCP-compatible client or LLM can discover and use them. In this ecosystem:

  • Tool Calling is the model’s ability to invoke a function.
  • MCP is the standardized framework that exposes those functions for the model to call.

In practice, your application can point to an MCP server, and the LLM will automatically get the list of available tools, complete with their schemas and descriptions. This makes managing a large set of tools much cleaner and allows different models to interact with the same toolset without custom integrations.

From Text Processor to Action-Taker

My initial view of LLMs as simple text-in, text-out machines was fundamentally incomplete. I now understand that the ability to use tools is a deliberate, trained capability that transforms these models from passive information processors into active agents that can interact with the digital world.

The key takeaways from my deep dive are clear:

  1. Tool calling is a native feature of specific models, enabled by a structured tools parameter in API calls.
  2. This ability comes from specialized training and fine-tuning, not just clever prompting.
  3. As developers, we are responsible for orchestrating the execution of tools and handling potential errors.
  4. Standards like MCP are creating a more interoperable ecosystem for connecting LLMs to external capabilities.

This journey has demystified the “magic” behind tool-calling LLMs. They still operate on text, but their training allows them to understand and generate a special kind of text—structured API calls—that empowers them to take meaningful action.