πŸ€– AI Explained
Fast-moving: verify before relying on this 6 min read

Internal Coding Agents

Coding agents are moving from personal tools to team infrastructure. This module covers the architecture for deploying coding agents internally β€” startup context, sandboxed execution, CI integration, and the review gates that keep automation safe.

Layer 1: Surface

A coding agent used by one developer is a productivity tool. A coding agent deployed as internal infrastructure β€” invoked from Slack, wired into Linear and GitHub, running against your actual codebase β€” is a different kind of system. It needs the same engineering attention you would give a new microservice: startup context, execution sandboxing, review gates, and observability.

The architectural pattern is the same regardless of which agent tool your team picks:

Trigger β†’ Startup context β†’ Task β†’ Sandboxed execution β†’ Review gate β†’ Merge

What each stage does:

StageWhat it isWhy it matters
TriggerHow the agent is invoked (Slack command, Linear ticket, CI event)Determines latency expectations and who can invoke
Startup contextCodebase summary, conventions, and prior context injected at startWithout this, the agent re-discovers what your team already knows on every run
TaskThe goal the agent is given β€” issue description, PR review comment, failing testSpecificity here directly determines output quality
Sandboxed executionIsolated environment where the agent runs and tests codeWithout this, generated code runs in your real environment
Review gateHuman or automated check before anything mergesThe line between automation and autonomy
MergeCode lands in the repository with attributionPreserves audit trail

The pattern is stable. The tools implementing it β€” Claude Code, Cursor, OpenCode, Warp, and others β€” change rapidly. Build against the pattern, not the tool.

Production gotcha: Coding agents without a sandbox will run generated code in your production environment. Every internal coding agent deployment needs a sandboxed execution environment with network egress controls β€” not just Docker, but purpose-built agent sandboxes with filesystem isolation and time limits.


Layer 2: Guided

Startup context: AGENTS.md and the context file pattern

A coding agent that starts cold has to read your codebase from scratch, infer your conventions, and guess at your architecture. This takes tokens, takes time, and produces worse output. The fix is a context file β€” commonly named AGENTS.md β€” that tells the agent what it needs to know before it starts.

# AGENTS.md

## What this repo is
Python monorepo. Three services: api/, worker/, scheduler/. Shared code in lib/.

## Languages and runtimes
Python 3.12. Node 22 for frontend tooling only. No mixing.

## Coding conventions
- Type hints required on all public functions
- Tests live next to source: foo.py β†’ foo_test.py
- No print() in non-script code β€” use structlog
- Database access only through repositories in lib/db/

## How to run tests
pytest -x tests/  (fail-fast)
make test         (full suite, slow)

## How to add a dependency
Add to pyproject.toml and run: uv pip compile pyproject.toml -o requirements.txt

## What the agent should never do
- Never modify migration files directly β€” use make migration name=<name>
- Never commit .env files
- Never push directly to main

## Current priorities
- We are mid-migration from SQLAlchemy 1.x to 2.x. All new queries must use 2.x style.
- Prefer async/await in the api/ service; sync is fine in worker/ and scheduler/.

This file answers the questions the agent will ask anyway β€” it just answers them once, cheaply, before work starts. Keep it in the repository root, update it when conventions change, and treat it as real documentation.

Wiring the integration: Slack β†’ GitHub

Here is the minimal architecture for a Slack-invoked coding agent that opens pull requests:

import os
import subprocess
import tempfile
from pathlib import Path
from slack_bolt import App
from github import Github

app = App(token=os.environ["SLACK_BOT_TOKEN"])
gh = Github(os.environ["GITHUB_TOKEN"])
repo = gh.get_repo(os.environ["GITHUB_REPO"])

@app.command("/agent")
def handle_agent_command(ack, command, say):
    ack()
    task = command["text"].strip()

    if not task:
        say("Usage: /agent <task description>")
        return

    say(f"Starting agent for: `{task}`")

    # Run asynchronously so Slack doesn't time out
    import threading
    threading.Thread(target=run_agent_task, args=(task, command["user_id"], say)).start()


def run_agent_task(task: str, user_id: str, say) -> None:
    branch_name = f"agent/{task[:40].lower().replace(' ', '-')}"

    try:
        with tempfile.TemporaryDirectory() as sandbox_dir:
            # Clone the repo into the sandbox
            subprocess.run(
                ["git", "clone", os.environ["REPO_URL"], sandbox_dir],
                check=True, capture_output=True
            )

            # Create a new branch
            subprocess.run(
                ["git", "checkout", "-b", branch_name],
                cwd=sandbox_dir, check=True, capture_output=True
            )

            # Read startup context
            agents_md_path = Path(sandbox_dir) / "AGENTS.md"
            startup_context = agents_md_path.read_text() if agents_md_path.exists() else ""

            # Invoke the coding agent CLI in the sandbox directory
            result = subprocess.run(
                ["claude", "--print", "--no-conversation", f"{startup_context}\n\nTask: {task}"],
                cwd=sandbox_dir,
                capture_output=True,
                text=True,
                timeout=300  # 5-minute hard limit
            )

            if result.returncode != 0:
                say(f"<@{user_id}> Agent failed: ```{result.stderr[:500]}```")
                return

            # Check if any files changed
            diff_result = subprocess.run(
                ["git", "diff", "--name-only"],
                cwd=sandbox_dir, capture_output=True, text=True
            )

            if not diff_result.stdout.strip():
                say(f"<@{user_id}> Agent completed but made no changes.")
                return

            # Commit and push
            subprocess.run(
                ["git", "add", "-A"],
                cwd=sandbox_dir, check=True
            )
            subprocess.run(
                ["git", "commit", "-m", f"agent: {task[:72]}"],
                cwd=sandbox_dir, check=True
            )
            subprocess.run(
                ["git", "push", "origin", branch_name],
                cwd=sandbox_dir, check=True
            )

            # Open a pull request
            pr = repo.create_pull(
                title=f"[Agent] {task[:70]}",
                body=f"Invoked by <@{user_id}> via Slack.\n\nTask: {task}\n\n"
                     f"Agent output:\n```\n{result.stdout[:2000]}\n```",
                head=branch_name,
                base="main"
            )

            say(f"<@{user_id}> PR ready for review: {pr.html_url}")

    except subprocess.TimeoutExpired:
        say(f"<@{user_id}> Agent timed out after 5 minutes.")
    except Exception as e:
        say(f"<@{user_id}> Error: {str(e)[:300]}")

This is the minimal implementation. In production you will add: a proper sandbox (not just tempfile), egress controls, resource limits, and a queue for concurrent requests.

The sandbox problem

tempfile.TemporaryDirectory() in the example above is not a sandbox β€” it is filesystem isolation only. The agent process can still make outbound network calls, consume unbounded CPU, and write to paths outside the temp directory.

Purpose-built agent sandboxes solve this at the infrastructure level:

What you needWhat solves it
Filesystem isolationA fresh clone per task (already in the pattern above)
Network egress controlEgress proxy with allowlist (GitHub, your package registry, nothing else)
Resource limitsCPU/memory limits on the execution container
Time limitsHard timeout enforced by the sandbox runtime, not just timeout=
Secrets isolationNo production secrets in the sandbox environment

Services like Modal, Daytona, and Runloop provide this as managed infrastructure. Building it yourself from Docker Compose is feasible but requires ongoing maintenance β€” the container breakout risks are subtle.

The review gate

The review gate is what makes this automation rather than autonomy. The agent opens a PR; a human (or an automated check) decides whether to merge it. This is the correct default. Lower the gate only after you have:

  1. Built an eval dataset for your agent’s output on this task type
  2. Established a false-positive rate you are willing to accept
  3. Added a revert mechanism that the on-call can trigger in under two minutes

For most teams, the right progression is:

Phase 1: Agent drafts PRs β†’ human reviews every PR β†’ human merges
Phase 2: Agent drafts PRs β†’ CI auto-approves for low-risk task types β†’ human merges
Phase 3: Agent drafts PRs β†’ CI auto-approves and merges for well-specified tasks

Do not skip phases. Phase 2 without a calibrated CI approval policy is just phase 1 with less oversight.


Layer 3: Deep Dive

The org design question (for leaders)

When a team deploys a coding agent as internal infrastructure, the first question is not β€œwhich tool?” β€” it is β€œwho owns it?” The coding agent sits at the intersection of platform engineering, security, and developer experience. In most organisations, none of those three teams has a natural mandate over all three concerns simultaneously.

The deployment pattern that works best treats the coding agent as a platform product:

  • Platform team owns the sandbox, the Slack integration, the CI wiring, and the AGENTS.md convention
  • Security team owns the egress controls, the secrets policy, and the incident response playbook
  • Individual teams own their AGENTS.md files and their review gates

Without this separation, the coding agent either stalls in security review (owned by security alone) or gets deployed without adequate controls (owned by developers alone).

Rich startup context: beyond AGENTS.md

AGENTS.md covers static context β€” conventions, architecture, never-do rules. A production deployment also needs dynamic context: what changed recently, what is failing in CI, what the current sprint priorities are.

This context can be injected at invocation time:

def build_startup_context(repo_dir: str, task: str) -> str:
    agents_md = read_file_if_exists(f"{repo_dir}/AGENTS.md")

    # Recent changes β€” what did the last few commits touch?
    recent_log = subprocess.run(
        ["git", "log", "--oneline", "-10"],
        cwd=repo_dir, capture_output=True, text=True
    ).stdout

    # Current CI status β€” is main green?
    ci_status = fetch_ci_status(os.environ["GITHUB_REPO"])

    # Linear: open issues tagged for agent work
    linear_context = fetch_linear_agent_issues(os.environ["LINEAR_TEAM_ID"])

    return f"""
{agents_md}

## Recent changes (last 10 commits)
{recent_log}

## CI status
{ci_status}

## Open agent-tagged issues
{linear_context}

## Your task
{task}
""".strip()

Each additional context source narrows the space of plausible actions, which improves output quality and reduces the chance the agent works on something that conflicts with in-flight changes.

Failure modes specific to coding agents

Silent regression introduction: The agent writes code that passes existing tests but breaks an invariant not covered by tests. This is the same failure mode as a junior engineer β€” but it happens at machine speed across many PRs. Mitigation: expand test coverage before expanding agent usage; the agent should not be trusted to maintain correctness beyond what tests can verify.

Context drift in long tasks: On tasks that span many files, the agent accumulates a long context of file contents and tool outputs. After 30-40 tool calls, the agent’s working memory of the early files it read has been compressed or forgotten β€” it starts making changes inconsistent with those files. Mitigation: limit task scope to a single well-defined change; decompose large tasks into smaller ones at the orchestration layer.

AGENTS.md rot: The startup context file documents the codebase at a point in time. If the team does not maintain it, the agent gets outdated instructions β€” β€œuse Python 3.12” when the repo moved to 3.13, β€œqueries go in lib/db/” when the team restructured. Mitigation: treat AGENTS.md as a first-class artifact with PR reviews and a quarterly review cycle.

Sandbox escape via dependency install: An agent that can install dependencies can install a package that makes outbound network calls, regardless of your egress controls. Mitigation: pre-install dependencies in the sandbox image; restrict the agent’s ability to install new packages unless you have reviewed the package first.

Attribution loss: When the agent commits code, the author is the agent (or the bot account). Six months later, git blame returns no useful information about why a decision was made. Mitigation: require the agent to include a structured comment in every PR explaining its reasoning; archive the agent’s reasoning trace with the PR.

Tooling landscape note

As of 2026, the coding agent tools in active use include Claude Code (terminal-first, scriptable, exposed as a CLI), Cursor (IDE-integrated), OpenCode (open-source terminal agent), and Warp (terminal with AI features). The architectural patterns in this module apply regardless of which tool you use. Specific capabilities β€” file watching, multi-file context, background tasks β€” vary by version and change frequently. Read each tool’s current documentation before wiring it into infrastructure.

Further reading

✏ Suggest an edit on GitHub

Internal Coding Agents β€” Check your understanding

Q1

Your team deploys a Slack-invoked coding agent that clones the repo and runs generated code. Three days after launch, the agent executes a generated script that makes 200 API calls to your production payment processor. No sandbox was in place. Which part of the deployment architecture failed?

Q2

A senior engineer on your team argues that Docker is sufficient sandboxing for the coding agent. You disagree. What is the strongest argument against Docker-only sandboxing for agent workloads?

Q3

Six months after deploying an internal coding agent, engineers report that the agent keeps using SQLAlchemy 1.x patterns even though the team migrated to 2.x four months ago. The model version has not changed. What is the most likely cause?

Q4

Your engineering leader wants to skip the PR review gate for coding agent output on 'low-risk' tasks like updating documentation. What should you establish before agreeing to this?

Q5

You are designing the ownership model for your internal coding agent deployment. Security wants control of the sandbox, platform wants to own the CI integration, and the individual product teams want to control the AGENTS.md files. A manager suggests one team should own everything. What is the better approach?