Architecture Overview — For SRE / DevOps

Runtime Architecture

An AI agent is three layers at runtime. Map them to your infra and everything else follows.

┌─────────────────────────────────────────┐
│              LLM (API call)             │
│  Stateless. No session affinity.        │
│  Input: tokens → Output: tokens/tool_use│
└──────────────┬──────────────────────────┘
               │ HTTPS (API-hosted) or
               │ local inference (self-hosted)
     ┌─────────▼──────────┐
     │   Host Application  │  ← Your process
     │  (Claude Code, etc) │
     └─────┬─────────┬─────┘
           │         │
    Native tools   MCP Servers
    (subprocesses) (stdio: child process)
                   (HTTP+SSE: remote service)

Key property: The LLM is stateless. Every API call is a fresh forward pass. No session state, no sticky routing, no connection pooling. This is operationally excellent — horizontal scaling is trivial.

Layer 1: Tools — The Action Interface

Tools are structured function calls. The model emits a JSON request, the host executes it, and returns the result.

What you care about:

Process model: Native tools (bash, file operations) run as subprocesses of the host. They inherit the host’s user permissions, environment, and filesystem access.
Trust boundary: The model proposes, the host disposes. Every tool call passes through a permission layer before execution. This is your primary security control.
Blast radius: More tools = more capability = wider attack surface. Tool scoping is access control.
Observability: Every tool call is a structured event (name, args, result, duration). Trivial to log, alert on, and audit.

tool_use event:
  name: "bash"
  args: { "command": "kubectl get pods -n production" }
  duration_ms: 340
  result_size_bytes: 2048
  is_error: false

Layer 2: Skills — Configuration as Context

Skills are markdown files loaded into the LLM’s context before specific tasks. From an ops perspective, they’re configuration artifacts.

What you care about:

File I/O at request time: Each skill load is a filesystem read. Negligible latency (~1-5ms) but it happens per-request.
Token budget: A 2,000-word skill costs ~2,500 tokens of the context window. This is a resource that competes with conversation history and tool results.
Version control: Skill files go in the repo. Changes should be code-reviewed — a bad skill degrades output quality team-wide.
No runtime dependencies: No database, no embedding service, no retrieval pipeline. Just files on disk. Zero infrastructure.

Layer 3: MCP — The Integration Protocol

MCP (Model Context Protocol) standardizes how the host connects to external systems. Two transport modes, each with different infra implications.

stdio (local)

Host process
  └── spawns MCP server as child process
      └── communicates via stdin/stdout pipes

Lifecycle: server lives and dies with the host
Security: runs as the host’s user — inherits all permissions
Use for: local filesystem, CLI tools, development
Monitoring: process-level (check if child is alive, stderr for logs)

HTTP + SSE (remote)

Host process
  └── HTTP POST to /message (requests)
  └── SSE stream from /sse (responses)
      └── MCP server (remote, multi-client)

Lifecycle: independent — runs as a service (systemd, container, pod)
Security: needs auth (OAuth 2.0 recommended), TLS, network policy
Use for: shared services, remote infra, multi-tenant
Monitoring: standard HTTP observability (latency, error rates, connection health)

What MCP servers expose

Primitive	Controlled by	Ops implication
Tools	Model (autonomous)	Actions the AI takes — audit these
Resources	Host app	Data the AI reads — scope carefully
Prompts	User	Templates — low risk, user-initiated

Trust Boundaries

┌─ Trust boundary 1: What the model can REQUEST ─┐
│  Defined by: tool definitions exposed to the LLM │
└─────────────────┬────────────────────────────────┘
                  │
┌─ Trust boundary 2: What the host EXECUTES ──────┐
│  Defined by: permission layer in the host app     │
│  (Claude Code prompts the user for risky actions) │
└─────────────────┬────────────────────────────────┘
                  │
┌─ Trust boundary 3: What the service ALLOWS ─────┐
│  Defined by: MCP server scoping, OAuth scopes,    │
│  database user permissions, K8s RBAC, etc.         │
└──────────────────────────────────────────────────┘

Defense in depth. The model can only request what tools are defined. The host can reject requests. The underlying service has its own auth. Three layers of control.

Deployment Topology

Development / single-user:

Laptop
├── Claude Code (host)
│   ├── native tools (bash, file I/O)
│   └── MCP servers (stdio, child processes)
└── LLM API calls → Anthropic cloud

Production / shared:

K8s cluster
├── AI agent pod (host)
│   ├── native tools (sandboxed)
│   └── MCP clients → MCP server pods (HTTP+SSE)
│       ├── github-mcp (Deployment, 2 replicas)
│       ├── postgres-mcp (Deployment, 1 replica)
│       └── internal-api-mcp (Deployment, 3 replicas)
└── LLM API calls → Anthropic cloud (or local Ollama)

MCP servers in production are just services. Deploy them like you deploy anything else — containers, health checks, resource limits, network policies.

Key Takeaways

The LLM is stateless — no session affinity, trivial to scale
Tools are structured events — easy to log, audit, and gate
Skills are config files — version control them like .eslintrc
MCP servers are services — deploy them like microservices
Three trust boundaries give you defense in depth