Incident Response for AI Systems: AI Explained

Layer 1: Surface

Software incidents are usually about the system being down or slow. AI incidents are different: the system is running fine, but it is saying the wrong thing, leaking data, or taking actions it should not. Detecting these requires different monitoring. Fixing them requires understanding whether the problem is in the model, the prompt, the data, or the configuration.

AI incident categories:

Category	What happened	Example
Safety violation	Model produces harmful, hateful, or prohibited content	A customer support bot starts generating offensive responses
Policy bypass	Model produces output that violates stated business rules	A chatbot recommends a competitor despite instructions not to
Data leakage	PII or confidential data appears in model output	A user receives another user’s order history in a response
Reputational harm	Model output causes brand damage or user distress	Screenshots of inappropriate responses go viral
Agentic harm	An AI agent takes an action that causes real-world damage	An agent deletes production records under a jailbroken instruction

Why it matters

AI incidents have a different escalation curve than software bugs. A software bug affects a deterministic subset of users. An AI misbehaviour can affect every user who sends a similar input: potentially thousands of people before the incident is detected. The first priority in an AI incident is always containment, then investigation.

Production Gotcha

Common Gotcha: AI incidents often cannot be reproduced because prompts and context are not logged. Without a trace of the full input that triggered the misbehaviour, root cause analysis is guesswork. Invest in prompt logging with retention before you need it: retrofitting logging after an incident is too late.

Teams build prompt logging into the roadmap but it slips to “later”. When an incident occurs, the on-call engineer has the timestamp, the user ID, and a screenshot, but not the actual prompt and context that triggered the misbehaviour. Root cause analysis becomes “the model did something bad” without the ability to reproduce it, fix it, or test that the fix works.

Layer 2: Guided

Incident detection: what to monitor

from dataclasses import dataclass
from enum import Enum

class AlertSeverity(Enum):
    P1 = "p1"  # Immediate response required
    P2 = "p2"  # Response within 1 hour
    P3 = "p3"  # Response within 1 business day

@dataclass
class MonitoringSignal:
    name: str
    description: str
    detection_method: str
    severity: AlertSeverity
    metric_name: str

AI_MONITORING_SIGNALS: list[MonitoringSignal] = [
    MonitoringSignal(
        name="safety_classifier_block_rate",
        description="Spike in output classifier blocks — possible jailbreak wave",
        detection_method="Block rate > 5x baseline over 10-minute window",
        severity=AlertSeverity.P1,
        metric_name="guardrail.output.block_rate",
    ),
    MonitoringSignal(
        name="user_reports_harmful_content",
        description="Users flagging AI responses as harmful or inappropriate",
        detection_method="User report rate > 2x baseline; any P1 content flag",
        severity=AlertSeverity.P1,
        metric_name="user.report.harmful",
    ),
    MonitoringSignal(
        name="pii_detection_in_output",
        description="PII detected in model output — potential data leakage",
        detection_method="PII classifier fires on output before sending to user",
        severity=AlertSeverity.P1,
        metric_name="guardrail.pii.output_detected",
    ),
    MonitoringSignal(
        name="anomalous_tool_usage",
        description="Tool being called with unusual argument patterns",
        detection_method="Argument value distribution deviates from baseline",
        severity=AlertSeverity.P2,
        metric_name="tool.call.anomaly_score",
    ),
    MonitoringSignal(
        name="refusal_rate_spike",
        description="Model refusing more requests than usual — may indicate attack or misconfiguration",
        detection_method="Refusal rate > 3x baseline over 30-minute window",
        severity=AlertSeverity.P2,
        metric_name="model.refusal_rate",
    ),
]

Prompt logging infrastructure

The prerequisite for any investigation:

import time
import uuid
import hashlib
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class PromptLogEntry:
    log_id: str
    timestamp: float
    user_id: str                    # Hashed in storage for privacy
    session_id: str
    model: str
    system_prompt_hash: str         # Hash, not the prompt itself (protect IP)
    user_message: str               # The actual input
    assistant_response: str         # The actual output
    input_tokens: int
    output_tokens: int
    guardrail_decisions: list[dict] # Any guardrail blocks or flags
    tool_calls: list[dict]          # Tool calls made during this turn
    latency_ms: float
    incident_flag: bool = False     # Set true if this entry is part of an incident

def create_log_entry(
    user_id: str,
    session_id: str,
    model: str,
    system_prompt: str,
    user_message: str,
    assistant_response: str,
    input_tokens: int,
    output_tokens: int,
    guardrail_decisions: list[dict],
    tool_calls: list[dict],
    latency_ms: float,
) -> PromptLogEntry:
    return PromptLogEntry(
        log_id=str(uuid.uuid4()),
        timestamp=time.time(),
        user_id=hashlib.sha256(user_id.encode()).hexdigest()[:16],  # Pseudonymise
        session_id=session_id,
        model=model,
        system_prompt_hash=hashlib.sha256(system_prompt.encode()).hexdigest(),
        user_message=user_message,       # Retain full text; scrub PII in a separate pass
        assistant_response=assistant_response,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        guardrail_decisions=guardrail_decisions,
        tool_calls=tool_calls,
        latency_ms=latency_ms,
    )

Containment options

When an incident is detected, containment comes before investigation:

from enum import Enum

class ContainmentAction(Enum):
    RATE_LIMIT_USER = "rate_limit_user"           # Throttle the specific user
    DISABLE_TOOL = "disable_tool"                 # Remove a specific tool from all sessions
    TIGHTEN_GUARDRAIL = "tighten_guardrail"       # Lower classifier threshold
    ROLLBACK_MODEL = "rollback_model"             # Switch to previous model version
    ROLLBACK_PROMPT = "rollback_prompt"           # Switch to previous system prompt
    KILL_SWITCH = "kill_switch"                   # Disable the AI system entirely
    RESTRICT_TO_READ_ONLY = "restrict_to_read_only"  # Disable all write tools

@dataclass
class ContainmentPlaybook:
    incident_type: str
    primary_action: ContainmentAction
    secondary_actions: list[ContainmentAction]
    description: str

CONTAINMENT_PLAYBOOKS: list[ContainmentPlaybook] = [
    ContainmentPlaybook(
        incident_type="safety_violation",
        primary_action=ContainmentAction.TIGHTEN_GUARDRAIL,
        secondary_actions=[ContainmentAction.ROLLBACK_PROMPT],
        description="Lower classifier threshold immediately; investigate system prompt for misconfiguration",
    ),
    ContainmentPlaybook(
        incident_type="data_leakage",
        primary_action=ContainmentAction.KILL_SWITCH,
        secondary_actions=[ContainmentAction.RATE_LIMIT_USER],
        description="Disable system for investigation; leakage incidents require full stop until root cause found",
    ),
    ContainmentPlaybook(
        incident_type="agentic_harm",
        primary_action=ContainmentAction.RESTRICT_TO_READ_ONLY,
        secondary_actions=[ContainmentAction.DISABLE_TOOL],
        description="Remove all write tools immediately; identify and disable the specific tool involved",
    ),
    ContainmentPlaybook(
        incident_type="jailbreak_wave",
        primary_action=ContainmentAction.RATE_LIMIT_USER,
        secondary_actions=[ContainmentAction.TIGHTEN_GUARDRAIL, ContainmentAction.ROLLBACK_PROMPT],
        description="Rate-limit affected accounts; tighten output classifier; review system prompt",
    ),
]

Root cause analysis framework

from dataclasses import dataclass

class RootCauseCategory(Enum):
    MODEL_FAILURE = "model_failure"           # The model behaved unexpectedly
    PROMPT_FAILURE = "prompt_failure"         # System prompt insufficient or misconfigured
    DATA_FAILURE = "data_failure"             # Retrieved or tool data caused the problem
    CONFIGURATION_FAILURE = "configuration"  # Guardrail or tool config misconfigured
    INFRASTRUCTURE_FAILURE = "infrastructure"  # Logging gap, auth failure, etc.

@dataclass
class IncidentRCA:
    incident_id: str
    incident_type: str
    root_cause_category: RootCauseCategory
    root_cause_description: str
    contributing_factors: list[str]
    remediation_steps: list[str]
    regression_test_added: bool
    recurrence_prevention: str

def reproduce_incident(log_entry: PromptLogEntry, current_system_prompt: str) -> str:
    """
    Replay a logged prompt to reproduce the incident.
    Requires the full prompt log to be available.
    """
    return llm.chat(
        model=log_entry.model,
        system=current_system_prompt,
        messages=[{"role": "user", "content": log_entry.user_message}],
    ).text

Communication: when and what to disclose

@dataclass
class DisclosureRequirement:
    trigger: str
    timeline: str
    audience: str
    required_content: list[str]
    regulatory_basis: str

DISCLOSURE_REQUIREMENTS: list[DisclosureRequirement] = [
    DisclosureRequirement(
        trigger="Personal data breach (GDPR definition met)",
        timeline="72 hours from awareness",
        audience="Supervisory authority (e.g., ICO for UK, lead DPA for EU)",
        required_content=[
            "Nature of the breach",
            "Categories and approximate number of individuals affected",
            "Likely consequences",
            "Measures taken or proposed",
        ],
        regulatory_basis="GDPR Article 33",
    ),
    DisclosureRequirement(
        trigger="Breach likely to result in high risk to individuals",
        timeline="Without undue delay",
        audience="Affected individuals",
        required_content=[
            "Nature of the breach in plain language",
            "DPO contact details",
            "Likely consequences",
            "Measures individuals can take to protect themselves",
        ],
        regulatory_basis="GDPR Article 34",
    ),
    DisclosureRequirement(
        trigger="Service disruption or safety incident",
        timeline="Per customer contract SLA",
        audience="Affected customers",
        required_content=[
            "What happened",
            "Impact on their service",
            "Actions taken",
            "Timeline to resolution",
        ],
        regulatory_basis="Contract / SLA",
    ),
]

Layer 3: Deep Dive

Post-mortem structure for AI incidents

AI post-mortems need fields that a standard software post-mortem does not:

Section	AI-specific content
Timeline	Include: when monitoring fired, when the incident was first created in production, when it was detected, when containment took effect
Root cause	Classify as: model failure / prompt failure / data failure / configuration failure: the fix differs by category
Reproduction	Was the incident reproducible from logs? If not, why not? What logging gaps does this reveal?
Guardrail assessment	Which guardrails were in place? Which should have caught this? Why did they not?
Scope	How many users were affected? Were any regulatory obligations triggered (GDPR breach notification)?
Regression test	The specific test case that would have caught this if it had existed
Blameless finding	Was this a systemic failure or a human error? If human error, what system change prevents recurrence?

Runbook template

A minimal AI incident runbook:

AI Incident Runbook — [System Name]

DETECT
1. Alert fires on [metric name]
2. Acknowledge in PagerDuty / on-call system
3. Check: is this a false positive? Compare to baseline from last 7 days.

CONTAIN (within 15 minutes of P1 alert)
4. If safety violation: tighten output guardrail (set threshold to 0.3)
5. If data leakage: enable kill switch in feature flags
6. If agentic harm: disable write tools via config flag
7. If uncertain: restrict to read-only as a precaution

INVESTIGATE
8. Pull prompt logs for affected session IDs
9. Reproduce the incident in staging using logged inputs
10. Classify root cause: model / prompt / data / config / infra
11. Identify blast radius: how many users affected?

REMEDIATE
12. Apply the appropriate fix (see root cause matrix below)
13. Test the fix against the reproduction case
14. Add regression test to eval suite
15. Roll out fix with monitoring for recurrence

COMMUNICATE
16. Internal: update incident channel every 30 minutes during P1
17. Customer: notify per SLA if service was degraded
18. Regulatory: check GDPR breach notification trigger (72h clock starts at awareness)

POST-MORTEM
19. Complete post-mortem within 5 business days
20. Confirm regression test is merged
21. Present findings to team; share learnings

Incident Response for AI Systems