🤖 AI Explained
5 min read

Human-in-the-Loop

Human oversight is not a bolt-on safety feature: it is an architectural primitive that determines what an agent is permitted to do autonomously and what requires a human decision. This module covers the design of approval gates, interrupt points, confidence escalation, and audit trails that make human oversight practical at scale.

Layer 1: Surface

An agent without human oversight is only as safe as its worst tool call. Human-in-the-loop (HITL) is the architecture that defines where humans remain in the decision chain.

Three mechanisms, each serving a different purpose:

MechanismWhat it doesWhen to use it
Approval gateAgent pauses and waits for explicit human approval before proceedingBefore irreversible or high-cost actions
Confidence escalationAgent proceeds autonomously when confident; routes to human when uncertainWhen task quality degrades below a threshold
Interrupt pointHuman can pause or cancel an in-flight agent sessionLong-running tasks where requirements may change

None of these are binary: they exist on a spectrum, and the right combination depends on the risk profile of the task and the cost of human review time.


Layer 2: Guided

Approval gates

An approval gate pauses the agent and sends a structured request for human review:

import asyncio
from dataclasses import dataclass
from enum import Enum

class ApprovalDecision(Enum):
    APPROVED = "approved"
    REJECTED = "rejected"
    MODIFIED = "modified"

@dataclass
class ApprovalRequest:
    action_id: str
    agent_id: str
    action_type: str
    description: str         # human-readable summary of what will happen
    payload: dict            # the actual action parameters
    reversibility: str       # "reversible", "partially_reversible", "irreversible"
    estimated_impact: str    # "low", "medium", "high"

async def request_approval(request: ApprovalRequest) -> tuple[ApprovalDecision, dict]:
    """Submit an approval request and wait for a human decision."""
    approval_id = await approval_queue.submit(request)

    # Wait up to 5 minutes for approval
    try:
        decision = await asyncio.wait_for(
            approval_queue.wait_for_decision(approval_id),
            timeout=300.0,
        )
        return decision.status, decision.modifications or {}
    except asyncio.TimeoutError:
        await approval_queue.expire(approval_id)
        raise ApprovalTimeoutError(f"Approval request {approval_id} timed out")

# Integrate into the tool execution path
APPROVAL_REQUIRED = {
    "delete_record":   "irreversible",
    "send_email":      "partially_reversible",
    "publish_content": "partially_reversible",
    "modify_config":   "reversible",
}

async def execute_with_gate(tool_name: str, arguments: dict, agent_id: str) -> str:
    reversibility = APPROVAL_REQUIRED.get(tool_name)
    if reversibility:
        request = ApprovalRequest(
            action_id=generate_id(),
            agent_id=agent_id,
            action_type=tool_name,
            description=build_human_description(tool_name, arguments),
            payload=arguments,
            reversibility=reversibility,
            estimated_impact=assess_impact(tool_name, arguments),
        )
        decision, modifications = await request_approval(request)

        if decision == ApprovalDecision.REJECTED:
            return f"Action '{tool_name}' was rejected by reviewer. Reason may be in modifications."
        if decision == ApprovalDecision.MODIFIED:
            arguments = {**arguments, **modifications}

    return TOOL_REGISTRY[tool_name](**arguments)

Confidence-based escalation

Rather than requiring approval for specific action types, escalate when the agent’s confidence is low:

def estimate_confidence(response_text: str, task_context: str) -> float:
    """Ask a fast model to assess the agent's confidence in its output."""
    check = llm.chat(
        model="fast",
        messages=[{
            "role": "user",
            "content": f"""Task context: {task_context}

Agent response: {response_text[:500]}

Rate the agent's confidence in this response on a scale from 0.0 to 1.0.
Consider: does it express uncertainty? Are claims specific and verifiable?
Is the reasoning complete?

Output only a number between 0.0 and 1.0."""
        }]
    )
    try:
        return float(check.text.strip())
    except ValueError:
        return 0.5  # default to middle if parsing fails

ESCALATION_THRESHOLDS = {
    "low_stakes":    0.4,   # Only escalate if very uncertain
    "medium_stakes": 0.65,
    "high_stakes":   0.85,  # Escalate unless highly confident
}

def run_with_escalation(task: str, stake_level: str = "medium_stakes") -> str:
    result = run_agent(task)
    confidence = estimate_confidence(result, task)
    threshold = ESCALATION_THRESHOLDS[stake_level]

    if confidence < threshold:
        return escalate_to_human(
            task=task,
            agent_result=result,
            confidence=confidence,
            reason=f"Confidence {confidence:.2f} below threshold {threshold}",
        )
    return result

Async interrupt points

For long-running agent sessions, expose control points that allow humans to pause, inspect, or cancel:

class InterruptibleAgent:
    def __init__(self, session_id: str):
        self.session_id = session_id
        self._pause_event = asyncio.Event()
        self._cancel_event = asyncio.Event()
        self._pause_event.set()  # Start unpaused

    async def check_interrupt(self):
        """Call this at the start of each iteration."""
        if self._cancel_event.is_set():
            raise AgentCancelled(f"Session {self.session_id} was cancelled by operator")
        if not self._pause_event.is_set():
            await status_store.set(self.session_id, "paused")
            await self._pause_event.wait()  # Block until resumed
            await status_store.set(self.session_id, "running")

    def pause(self):
        self._pause_event.clear()

    def resume(self):
        self._pause_event.set()

    def cancel(self):
        self._cancel_event.set()
        self._pause_event.set()  # Unblock if paused so cancellation is detected

    async def run(self, goal: str, tools: list[dict]) -> str:
        messages = [{"role": "user", "content": goal}]
        for step in range(10):
            await self.check_interrupt()
            response = llm.chat(model="balanced", messages=messages, tools=tools)
            if response.stop_reason == "end_turn":
                return response.text
            # ... tool execution ...
        return "Task incomplete."

Expose pause(), resume(), and cancel() through an admin API or UI so operators can intervene in real time.

Audit trail construction

Every human touchpoint must be recorded:

@dataclass
class AuditEvent:
    event_type: str          # "gate_presented", "approved", "rejected", "escalated", "auto_proceeded"
    session_id: str
    agent_id: str
    action_type: str
    action_payload: dict
    reviewer_id: str | None
    decision: str | None
    timestamp: float
    rationale: str | None

def build_audit_trail(session_id: str) -> list[AuditEvent]:
    """Reconstruct the full decision log for a session."""
    return audit_log.query(session_id=session_id, order_by="timestamp")

def export_audit_report(session_id: str) -> str:
    events = build_audit_trail(session_id)
    lines = [f"Audit trail for session {session_id}"]
    for e in events:
        lines.append(
            f"[{format_ts(e.timestamp)}] {e.event_type}: {e.action_type} "
            f"→ {e.decision or 'auto'}"
            + (f" (reviewer: {e.reviewer_id})" if e.reviewer_id else "")
        )
    return "\n".join(lines)

Layer 3: Deep Dive

HITL as architectural primitive

Human oversight is not an afterthought added to handle edge cases: it is a design constraint that shapes the entire agent architecture. Before building, answer:

  1. What actions require human approval, always? (irreversible writes, high-cost operations, external communications)
  2. What triggers escalation? (confidence thresholds, specific entity types, value thresholds)
  3. How fast can humans respond? (determines whether synchronous or async approval is feasible)
  4. What does “no response” mean? (timeout → auto-reject, auto-approve, or hold indefinitely?)

These decisions belong in the design phase, not in the incident review.

Reversibility assessment

Before acting, classify reversibility programmatically:

REVERSIBILITY_MAP = {
    # Fully reversible — can undo completely
    "create_draft": "reversible",
    "add_tag": "reversible",
    "modify_config": "reversible",

    # Partially reversible — some consequences can't be undone
    "send_notification": "partially_reversible",   # notification received, can't unsend
    "charge_card": "partially_reversible",          # charge can be refunded but not un-made

    # Irreversible — cannot be undone
    "delete_record": "irreversible",
    "send_email": "irreversible",
    "publish_publicly": "irreversible",
    "provision_infrastructure": "irreversible",
}

def gate_level_for(action: str, value: float | None = None) -> str:
    """Return the review level required for this action."""
    reversibility = REVERSIBILITY_MAP.get(action, "irreversible")
    if reversibility == "irreversible":
        return "mandatory_review"
    if reversibility == "partially_reversible":
        return "review_if_high_value" if (value or 0) > 100 else "auto_proceed"
    return "auto_proceed"

Escalation tiers

A flat “human approves everything” model doesn’t scale. Structure escalation in tiers:

TierRoute toResponse time targetTrigger
AutomatedNo human; agent proceedsImmediateHigh confidence, reversible action
Async reviewOn-call queue; reviewed within 1h1 hourMedium confidence or partially reversible
Synchronous gateReal-time reviewer in UI5 minutesLow confidence or irreversible action
EscalationManager or domain expert30 minutesHigh-value, ambiguous, or novel situation

Further reading

✏ Suggest an edit on GitHub

Human-in-the-Loop: Check your understanding

Q1

An agent is configured to require human approval for every action, including read-only searches. Users report the agent is 'unusable.' What design principle does this violate?

Q2

An agent submits an approval request for an irreversible action and waits. The human reviewer never responds. After 5 minutes, the approval times out. What is the correct default behaviour on timeout?

Q3

A confidence-based escalation system routes agent outputs with confidence below 0.65 to human review. An agent produces an output with estimated confidence 0.82 but containing a factual error. What does this reveal about confidence-based escalation?

Q4

An audit trail records every human touchpoint in an agent session. Six months after a disputed action, a user claims they never approved the deletion of their records. What fields in the audit trail are essential to resolve this dispute?

Q5

A long-running agent task is in progress when a business requirement changes: the task should stop before completing a now-unwanted step. The agent has no interrupt mechanism. What is the only remaining option?