🤖 AI Explained
7 min read

Structured Output and Tool Use

Getting reliable, machine-readable output from an LLM requires more than asking nicely. Structured output and tool use turn a text generator into a component your application can depend on.

Layer 1: Surface

LLMs return free-form text. Your application usually needs structured data.

The naive approach, parse the text yourself, breaks as soon as the model formats its output slightly differently. The robust approach is to constrain what the model is allowed to return, either by asking for a specific schema or by using tool use (also called function calling), where the model fills in typed function arguments rather than writing free text.

These are two points on the same spectrum:

PatternWhat it doesBest for
Prompted JSONSystem prompt instructs model to return JSON; you parse and validateSimple extraction, low stakes
Tool use (also called function calling)Model fills typed function arguments; add strict: true for API-level schema guaranteesExtraction, triggering actions, agents

The key shift: instead of hoping the model formats its response correctly, you define the shape of valid output. With strict: true on a tool definition, the API enforces that shape: guaranteed types, required fields present, no unexpected properties.

Production Gotcha

Common Gotcha: Tool definitions count against your context window. An application with 30 registered tools is spending tokens on those definitions every single request: even when most tools are irrelevant. Define only the tools needed for the current task, or select a relevant subset dynamically based on user intent before sending the request.


Layer 2: Guided

Prompted JSON (simplest, least reliable)

Works for low-stakes extraction when you control the prompt carefully:

# --- pseudocode ---
import json

def extract_contact(text: str) -> dict:
    response = llm.chat(
        model="balanced",
        system=(
            "Extract contact information and return valid JSON only. "
            "No markdown, no explanation — raw JSON.\n\n"
            'Schema: {"name": string, "email": string or null, "phone": string or null}'
        ),
        messages=[{"role": "user", "content": text}],
        max_tokens=256,
    )
    return json.loads(response.text)
# In practice — Anthropic SDK
import anthropic
import json

client = anthropic.Anthropic()

def extract_contact(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system=(
            "Extract contact information and return valid JSON only. "
            "No markdown, no explanation — raw JSON.\n\n"
            'Schema: {"name": string, "email": string or null, "phone": string or null}'
        ),
        messages=[{"role": "user", "content": text}]
    )
    return json.loads(response.content[0].text)
    # OpenAI: response.choices[0].message.content | Gemini: response.text

Fragile: the model may wrap the JSON in a markdown code block, add a comment, or omit a field. Use this only when you have a reliable fallback for parse failures.

Tool use / function calling (structured output)

Tool use asks the model to invoke a typed function instead of writing text. The model returns structured arguments: no markdown stripping, no json.loads. Add strict: true to the tool definition to get API-level guarantees: correct types, all required fields present, no unexpected properties.

The examples below use Anthropic’s SDK. The concept is identical across providers, but the response shape differs: see the provider comparison table at the end of this layer.

import anthropic

client = anthropic.Anthropic()

# strict: True — the API guarantees types and required fields
tools = [
    {
        "name": "extract_contact",
        "description": "Extract contact information from text.",
        "strict": True,  # API-level schema enforcement
        "input_schema": {
            "type": "object",
            "properties": {
                "name":  {"type": "string", "description": "Full name"},
                "email": {"type": "string", "description": "Email address"},
                "phone": {"type": "string", "description": "Phone number including country code"},
            },
            "required": ["name"],
            "additionalProperties": False,  # required by strict mode
        },
    }
]

def extract_contact(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        tools=tools,
        tool_choice={"type": "tool", "name": "extract_contact"},  # force this tool
        messages=[{"role": "user", "content": text}]
    )

    # The model's response will be a tool_use block, not text
    tool_use = next(b for b in response.content if b.type == "tool_use")
    return tool_use.input  # already a dict — no json.loads needed

Without strict: True, the model generally conforms to the schema but the API does not guarantee it: you may receive wrong types or missing required fields, and should validate before use.

tool_choice: {"type": "tool", "name": "extract_contact"} forces the model to call that specific tool. Without it, the model may choose to respond in text.

Executing real actions with tool use

Tool use becomes powerful when the tools actually do things. The model decides when and how to call a tool; your code executes it and returns the result:

import anthropic
import json

client = anthropic.Anthropic()

# Tool definitions
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city":    {"type": "string"},
                "country": {"type": "string", "description": "ISO 3166-1 alpha-2 country code"},
            },
            "required": ["city"],
        },
    },
    {
        "name": "get_forecast",
        "description": "Get a 5-day weather forecast for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "days": {"type": "integer", "minimum": 1, "maximum": 5},
            },
            "required": ["city"],
        },
    },
]

def run_tool(name: str, inputs: dict) -> str:
    """Execute the tool and return a result string."""
    if name == "get_weather":
        # Replace with a real weather API call
        return json.dumps({"city": inputs["city"], "temp_c": 18, "condition": "partly cloudy"})
    if name == "get_forecast":
        return json.dumps({"city": inputs["city"], "forecast": ["sunny", "cloudy", "rain", "sunny", "sunny"]})
    return json.dumps({"error": f"unknown tool: {name}"})

def chat_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            # Model responded in text — we're done
            return next(b.text for b in response.content if b.type == "text")

        if response.stop_reason == "tool_use":
            # Collect all tool calls in this turn
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = run_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })

            # Add the assistant's tool-call turn and the results to history
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            # Loop — let the model continue with the tool results

chat_with_tools("What's the weather in Tokyo, and should I pack an umbrella for the next 5 days?")

This is the core loop behind most AI agents: model decides what to call → your code executes it → results go back to the model → repeat until end_turn.

Before vs After

String parsing: brittle:

# BAD: Fragile, breaks on any formatting variation
raw = llm_response.text
# What if the model says "The price is $42.50" vs "Price: 42.50" vs "42.5 USD"?
price = float(raw.split("$")[1].split()[0])  # crashes constantly in production

Tool use: robust:

# GOOD: Model fills a typed field; your code gets a float, always
tool_use = next(b for b in response.content if b.type == "tool_use")
price = tool_use.input["price"]  # typed, validated, no parsing

Common mistakes

  1. Not specifying tool_choice: When you need a specific tool called, set tool_choice explicitly. Without it, the model may answer in plain text.
  2. Over-broad tool descriptions: Vague descriptions like “do anything with files” confuse the model. Write descriptions as if documenting a public API: what it does, what it doesn’t do, and when to use it.
  3. Not handling end_turn in tool loops: If you build an agentic loop, always check stop_reason. A missing end_turn check creates an infinite loop.
  4. Ignoring required vs optional fields: Fields not in required may be omitted. Code that accesses them without a default will crash.
  5. Registering all tools for every request: See the production gotcha.

Provider comparison: function calling API shapes

The tool / function calling concept is the same across providers, but the request and response fields differ:

AnthropicOpenAI
Request fieldtoolstools
Force a specific tooltool_choice: {"type": "tool", "name": "..."}tool_choice: {"type": "function", "function": {"name": "..."}}
Model chose a toolstop_reason == "tool_use"finish_reason == "tool_calls"
Tool call in responseblock.type == "tool_use" (in response.content)choice.message.tool_calls[n]
Tool call IDblock.idtool_call.id
Tool argumentsblock.input (dict)tool_call.function.arguments (JSON string: parse it)
Result message role"user""tool"
Result format{"type": "tool_result", "tool_use_id": id, "content": "..."}{"role": "tool", "tool_call_id": id, "content": "..."}

The tool definition schema itself (name, description, parameters / input_schema using JSON Schema) is nearly identical: both providers follow the same JSON Schema vocabulary. The main structural difference is parameters (OpenAI) vs input_schema (Anthropic) as the key name.


Layer 3: Deep Dive

Schema design

Tool input schemas follow JSON Schema. A few design principles that matter in practice:

Use enums to constrain free-text fields:

{
  "status": {
    "type": "string",
    "enum": ["open", "in_progress", "resolved", "closed"],
    "description": "Current ticket status"
  }
}

Without an enum, the model may invent valid-sounding but unexpected values ("pending", "active"). Enums collapse the output space to exactly the values your downstream code handles.

Be specific in descriptions:

{
  "date": {
    "type": "string",
    "description": "Date in ISO 8601 format (YYYY-MM-DD). Use today's date if the user says 'today'."
  }
}

The model reads descriptions at inference time. Precise descriptions reduce ambiguity and reduce the need to re-prompt when the model guesses wrong.

Mark genuinely optional fields as not required: If a field is truly optional, omit it from required. The model will include it when the information is present and omit it when it isn’t: more reliable than having it guess a null/empty value.

Multi-tool calls in a single turn

The model can call multiple tools in a single response when the tasks are independent. This is more efficient than a serial loop:

# The model may return multiple tool_use blocks in one response
# when it determines the calls are parallelisable
for block in response.content:
    if block.type == "tool_use":
        # Execute concurrently if your implementation supports it
        result = run_tool(block.name, block.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": result,
        })

Treat each tool_use block independently; collect all results before returning them to the model as a single user turn.

Error handling in tool loops

Tool execution can fail. Return errors to the model in the result: don’t throw exceptions that break the loop. The model can often recover:

def run_tool_safe(name: str, inputs: dict) -> str:
    try:
        return run_tool(name, inputs)
    except Exception as e:
        # Return the error as a tool result — the model can decide what to do
        return json.dumps({"error": str(e)})

Also implement a maximum turn limit. A misbehaving tool or ambiguous task can produce a loop where the model keeps calling tools without reaching end_turn. A limit of 10–20 turns covers the vast majority of legitimate workflows.

Tool use vs fine-tuning for structured output

For applications that need consistent structured output, tool use with a schema is almost always preferable to fine-tuning:

Tool useFine-tuning
Schema changesUpdate the tool definitionRe-train and re-deploy
New fieldsAdd to schemaCollect examples, re-train
ValidationAPI-enforcedMust validate yourself
CostStandard inferenceTraining + higher per-token cost

Fine-tuning for structured output makes sense only when the output has highly domain-specific patterns (e.g. a proprietary format) that are hard to express in a JSON schema or prompt.

The tool use ↔ agent relationship

Everything in the agents track builds on this loop:

user message

model decides: respond or call tool?
    ↓ (tool)
your code executes the tool

result returned to model

model decides: respond or call another tool?
    ↓ (respond)
final answer to user

The sophistication of an agent is largely a function of the tools it has access to and how well those tools are defined. A well-designed tool interface is more valuable than a more capable model given a poorly defined one.

Further reading

✏ Suggest an edit on GitHub

Structured Output and Tool Use: Check your understanding

Q1

What is the main advantage of using tool use over prompted JSON for structured extraction?

Q2

You want the model to always call a specific tool called 'classify_ticket'. What should you set in the API request?

Q3

Your tool-use loop runs but never terminates. The model keeps calling tools without producing a final answer. What is the most likely cause and the correct fix?

Q4

You have a tool with a 'priority' field that accepts values low, medium, or high. What schema feature best prevents the model from inventing unexpected values like 'urgent' or 'critical'?

Q5

Your application has 40 registered tools but any given user request only needs 2–3 of them. What is the production concern?