๐Ÿค– AI Explained
5 min read

Testing and Reliability

Tool-using systems are hard to test because the interesting behavior emerges from the interaction between the model and the tools, not from either alone. This module covers the testing strategy that catches real failures: schema drift, unexpected model behavior, and integration regressions.

Layer 1: Surface

Testing a tool-using system has three distinct layers:

LayerWhat it testsModel involved?Speed
UnitTool implementation and schema validationNoFast (ms)
ContractSchema agreement between mock and real APINo (uses real API samples)Medium
IntegrationEnd-to-end: model + tools togetherYesSlow + costs tokens

Most teams write only unit and integration tests, skipping contract tests. This creates a gap: unit tests pass (the mock is clean), integration tests pass (the model is well-behaved), but in production the real API returns something the mock never did.

All three layers are necessary. Unit tests fast-cycle your logic. Contract tests keep your mocks honest. Integration tests confirm the model uses your tools the way you expect.


Layer 2: Guided

Unit tests for tool implementations

import re
import pytest
from unittest.mock import MagicMock

# The tool implementation โ€” validates format before calling the API
def get_customer_order(order_id: str) -> dict:
    if not re.match(r"^ORD-\d{8}$", order_id):
        raise ValueError(f"Invalid order_id: {order_id!r}. Expected format: ORD-XXXXXXXX")
    response = api_client.get(f"/orders/{order_id}")
    if response.status_code == 404:
        return {"error": "Order not found", "order_id": order_id}
    return response.json()

# Unit test โ€” mock the API, test the tool's logic
class TestGetCustomerOrder:
    def test_returns_order_data(self, mock_api):
        mock_api.get.return_value = MagicMock(
            status_code=200,
            json=lambda: {"id": "ORD-00000001", "status": "shipped", "total": 49.99}
        )
        result = get_customer_order("ORD-00000001")
        assert result["status"] == "shipped"
        assert result["total"] == 49.99

    def test_handles_not_found(self, mock_api):
        mock_api.get.return_value = MagicMock(status_code=404, json=lambda: {})
        result = get_customer_order("ORD-00000001")  # valid format, 404 response
        assert "error" in result
        assert result["order_id"] == "ORD-00000001"

    def test_rejects_invalid_order_id_format(self):
        with pytest.raises(ValueError, match="Invalid order_id"):
            get_customer_order("not-an-order-id")

Mock tool setup for model tests

To test model behavior without incurring real API costs or tool side effects:

from typing import Callable

class MockTool:
    """A configurable mock that records calls and returns preset responses."""

    def __init__(self, name: str, schema: dict, responses: list | Callable):
        self.name = name
        self.schema = schema
        self._responses = responses if callable(responses) else iter(responses)
        self.calls: list[dict] = []

    def __call__(self, **kwargs) -> str:
        self.calls.append(kwargs)
        if callable(self._responses):
            return self._responses(**kwargs)
        return next(self._responses)

    def as_tool_schema(self) -> dict:
        return self.schema

# Build a mock tool set for a test
def make_test_tools():
    search_mock = MockTool(
        name="search_knowledge_base",
        schema={
            "name": "search_knowledge_base",
            "description": "Search the knowledge base for relevant documents.",
            "input_schema": {
                "type": "object",
                "properties": {"query": {"type": "string", "description": "Search query"}},
                "required": ["query"],
            },
        },
        responses=[
            '{"results": [{"id": "doc-1", "text": "Refunds take 5โ€“7 business days."}]}',
            '{"results": []}',
        ],
    )
    return {search_mock.name: search_mock}, [search_mock.as_tool_schema()]

# Test that the model selects the right tool for a refund query
# This test verifies the model's tool selection decision โ€” not tool execution.
# The mock records calls only when executed via the full agentic loop (integration tests).
def test_refund_query_selects_search_tool():
    tools_dict, tool_schemas = make_test_tools()

    response = llm.chat(
        model="balanced",
        messages=[{"role": "user", "content": "How long do refunds take?"}],
        tools=tool_schemas,
    )

    assert response.stop_reason == "tool_use"
    tool_call = next(tc for tc in response.tool_calls if tc.name == "search_knowledge_base")
    assert "refund" in tool_call.arguments.get("query", "").lower()

Contract tests

Contract tests verify that your mock matches what the real API actually returns:

import pytest
import httpx

# The contract: what fields and types does the real API return?
ORDER_CONTRACT = {
    "id": str,
    "status": str,
    "total": (int, float),
    "created_at": str,
    "line_items": list,
}

@pytest.mark.contract  # Mark to skip in unit test runs; run in staging CI
def test_order_api_contract():
    """Verify real API response matches our mock's structure."""
    response = httpx.get(
        "https://api.example.com/v1/orders/ORD-00000001",
        headers={"Authorization": f"Bearer {TEST_API_KEY}"},
    )
    assert response.status_code == 200
    data = response.json()

    for field, expected_type in ORDER_CONTRACT.items():
        assert field in data, f"Missing field: {field}"
        assert isinstance(data[field], expected_type), (
            f"Field {field}: expected {expected_type}, got {type(data[field])}"
        )

Run contract tests in a staging environment on a schedule (daily or weekly), not just on deployment. APIs change without notice.

Replay traces

Record real tool interactions and replay them as test fixtures:

import json
from pathlib import Path

class ToolCallRecorder:
    """Wraps tool execution and records call/response pairs."""

    def __init__(self, output_dir: str = "test_fixtures"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self._trace: list[dict] = []

    def record(self, name: str, arguments: dict, result: str):
        self._trace.append({
            "tool": name,
            "arguments": arguments,
            "result": result,
        })

    def save(self, scenario_name: str):
        path = self.output_dir / f"{scenario_name}.json"
        path.write_text(json.dumps(self._trace, indent=2))

class TraceReplayer:
    """Replays a recorded trace without calling real tools."""

    def __init__(self, fixture_path: str):
        self._calls = json.loads(Path(fixture_path).read_text())
        self._index = 0

    def execute(self, name: str, arguments: dict) -> str:
        entry = self._calls[self._index]
        assert entry["tool"] == name, f"Expected {entry['tool']}, got {name}"
        self._index += 1
        return entry["result"]

Replay tests catch regressions when you change model, prompt, or tool schema: if the recorded trace no longer matches what the model requests, the test fails.

Failure injection

Test how your system behaves when tools fail:

import random

class FaultInjectingTool:
    """Wraps a tool and injects configurable faults."""

    def __init__(self, real_tool: Callable, error_rate: float = 0.3):
        self.real_tool = real_tool
        self.error_rate = error_rate

    def __call__(self, **kwargs) -> str:
        if random.random() < self.error_rate:
            fault = random.choice([
                lambda: (_ for _ in ()).throw(TimeoutError("Tool timed out")),
                lambda: "Error: service temporarily unavailable",
                lambda: "Error: rate limit exceeded โ€” try again in 30s",
                lambda: json.dumps({"error": "internal_error", "code": 500}),
            ])
            return fault() if callable(fault) else fault
        return self.real_tool(**kwargs)

# Use in tests to verify graceful degradation
def test_session_handles_tool_timeout():
    faulty_search = FaultInjectingTool(real_search_tool, error_rate=1.0)
    result = run_tool_loop(
        "What is our refund policy?",
        tools=[make_schema(faulty_search)],
        tool_executor=faulty_search,
    )
    # System should still return something useful, not crash
    assert result is not None
    assert "unavailable" in result.lower() or "unable" in result.lower()

Go-live checklist

# Pre-launch verification script
import sys

checks = []

def check(name: str, condition: bool, severity: str = "MUST"):
    checks.append({"name": name, "passed": condition, "severity": severity})
    status = "โœ“" if condition else "โœ—"
    print(f"  [{status}] {name}")

print("=== Tool Integration Go-Live Checklist ===\n")

print("Schema & Validation")
check("All tools have non-empty descriptions", all_tools_have_descriptions())
check("All required parameters are documented", all_required_params_documented())
check("Schema validation runs before tool execution", schema_validation_enabled())

print("\nSecurity")
check("External content is delimited in all tool results", external_content_delimited())
check("Tool permission checks are in place", permission_checks_enabled())
check("No secrets in tool schemas or descriptions", no_secrets_in_schemas())

print("\nResilience")
check("Per-tool timeouts configured", timeouts_configured())
check("Circuit breakers enabled for all external dependencies", circuit_breakers_enabled())
check("Graceful degradation tested for all optional tools", degradation_tested())

print("\nObservability")
check("Tool call logging enabled", tool_logging_enabled())
check("Per-session cost tracking enabled", cost_tracking_enabled())
check("Error rate alerts configured", alerts_configured())

print("\nTests")
check("Unit tests passing", unit_tests_pass(), severity="MUST")
check("Contract tests passing against staging APIs", contract_tests_pass(), severity="MUST")
check("Integration tests run against staging with real model", integration_tests_pass(), severity="MUST")

failures = [c for c in checks if not c["passed"] and c["severity"] == "MUST"]
if failures:
    print(f"\nโœ— {len(failures)} MUST checks failed โ€” not ready for launch")
    sys.exit(1)
else:
    print("\nโœ“ All MUST checks passed")

Layer 3: Deep Dive

Property-based testing for schemas

Property-based tests generate hundreds of random inputs and verify invariants hold:

from hypothesis import given, strategies as st

@given(
    query=st.text(min_size=1, max_size=500),
    limit=st.integers(min_value=1, max_value=50),
)
def test_search_never_crashes_on_valid_inputs(query, limit):
    """For any valid input, search should return a result or a structured error โ€” never raise."""
    result = search_knowledge_base(query=query, limit=limit)
    assert isinstance(result, (str, dict))
    if isinstance(result, dict):
        assert "results" in result or "error" in result

@given(limit=st.integers().filter(lambda x: x < 1 or x > 50))
def test_search_rejects_out_of_range_limit(limit):
    """Limits outside 1โ€“50 should be caught by schema validation."""
    with pytest.raises((ValueError, jsonschema.ValidationError)):
        search_knowledge_base(query="test", limit=limit)

Property tests surface edge cases that example-based tests miss: empty strings, unicode, very long inputs, boundary values.

Staging strategy

Development โ†’ Staging โ†’ Canary โ†’ Production
StageWhat runsReal model?Real tools?
DevelopmentUnit + contract testsMockMock
StagingFull integration suiteReal (small model)Real (test accounts)
Canary5% of production trafficRealReal
Production100% of trafficRealReal

In staging, use real tool APIs but against test/sandbox accounts. Never point staging at production data stores: tool call errors in staging can corrupt production data.

Eval sets for tool-using systems

Beyond standard integration tests, maintain a set of scenarios that verify end-to-end model + tool behavior:

TOOL_EVAL_SET = [
    {
        "id": "order_lookup",
        "input": "What's the status of order ORD-00000001?",
        "expected_tools": ["get_customer_order"],
        "expected_args_contain": {"order_id": "ORD-00000001"},
        "answer_contains": ["shipped", "delivered", "processing"],
    },
    {
        "id": "unknown_order",
        "input": "Check order ORD-INVALID",
        "expected_tools": ["get_customer_order"],
        "answer_contains": ["not found", "unable to locate", "doesn't exist"],
    },
    {
        "id": "no_tool_needed",
        "input": "What are your business hours?",
        "expected_tools": [],  # Model should answer from context, not call a tool
        "answer_contains": ["hours", "open"],
    },
]

Run the eval set on every model upgrade, prompt change, or tool schema change. A drop in tool selection accuracy or answer quality is a signal to investigate before rolling out.

Further reading

โœ Suggest an edit on GitHub

Testing and Reliability: Check your understanding

Q1

Your unit tests use a mock that returns {"price": 9.99} (number). Six months later, the real API silently changes to return {"price": "9.99"} (string). Your unit tests still pass. What type of test would have caught this?

Q2

You want to verify that when the search_knowledge_base tool fails, the model still returns a useful response rather than crashing. What test approach is most direct?

Q3

You record a production tool trace (sequence of tool calls and responses for a real user session) and replay it in tests. A new prompt change causes the model to call tools in a different order. What does this tell you?

Q4

Your tool integration go-live checklist includes 'integration tests passing against staging with real model.' A team member suggests skipping this to save token costs. What is the risk?

Q5

A property-based test generates random string inputs for a tool's query parameter and finds that the string 'SELECT * FROM users' causes the tool to return an internal database error message. What does this indicate?