Coding Agents

Publish at:

An agent that drafts a marketing email has a problem: it cannot tell whether the result is good. An agent that writes code does not have that problem. It runs the tests, and the tests either pass or they do not. That binary signal — unambiguous, deterministic, instant — turns the ReAct loop from a general-purpose reasoning engine into a tight feedback cycle that converges on working software.

This property — tight, deterministic feedback — is why coding agents have advanced faster than agents in most other domains. The tool set is specialized, the planning strategies are concrete, and the orchestration patterns map cleanly to the structure of real programming tasks. But the domain also introduces its own challenges — large codebases that overflow the context window, multi-file changes that need coordination, and safety concerns that come with giving an agent a shell. We will walk through the architecture of a coding agent, the tools it needs, the feedback loops that drive quality, and the design decisions that separate a demo from something you would trust with a real codebase.

Anatomy of a Coding Agent #

A coding agent is a ReAct-style loop equipped with tools for reading files, writing code, running commands, and observing results. The core loop is the same think-act-observe cycle, but the tools and the nature of the observations are specific to software development.

  Task (bug report, feature request, refactor)
      │
      ▼
┌───────────┐
│  Reason   │◀─────────────┐
│ "What do  │              │
│ I change?"│              │
└─────┬─────┘              │
      │                    │
      ▼                    │
┌───────────┐       ┌──────┴───────┐
│  Edit     │       │  Observe     │
│  code     │       │  output,     │
└─────┬─────┘       │  errors,     │
      │             │  test results│
      ▼             └──────▲───────┘
┌───────────┐              │
│  Run      │──────────────┘
│  tests /  │
│  linter   │
└───────────┘

The reasoning step is where the model decides what to change. It is constrained by the codebase context the agent has gathered. A good coding agent reads the relevant files before it writes anything, the same way a human developer would.

The edit step produces code changes. These can be full file rewrites, targeted insertions, or search-and-replace operations. The format matters more than you might expect — we will cover that in the tooling section.

The run step executes something deterministic: a test suite, a linter, a type checker, a build command. The output of this step is the observation that drives the next iteration.

This loop is essentially the evaluator-optimizer pattern from workflow orchestration, but with a critical difference: the evaluator is the real execution environment — the compiler, the test runner, the runtime. That makes the feedback objective rather than subjective, which is why coding agents can self-correct more reliably than agents working on open-ended tasks.

The Tool Set #

A coding agent needs a small but carefully designed set of tools. The temptation is to give it everything — a full shell, unrestricted file access, a browser. But more tools means more decisions for the model, and more decisions means more chances to pick the wrong one. The most effective coding agents use a focused tool set where each tool has a clear purpose.

Reading Code #

Before the agent can change anything, it needs to understand what exists. Reading tools let the agent explore the codebase the way a developer would: browse the file tree, open files, search for symbols.

TOOLS = [
    {
        "name": "list_directory",
        "description": "List files and subdirectories in a directory. "
                       "Use to understand project structure before diving into files.",
        "parameters": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Relative path from project root. Use '.' for root."
                }
            },
            "required": ["path"],
            "additionalProperties": False
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a file. Returns the full text. "
                       "Use to examine existing code before making changes.",
        "parameters": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Relative path to the file from project root."
                },
                "start_line": {
                    "type": "integer",
                    "description": "First line to read (1-indexed). Omit to start from the beginning."
                },
                "end_line": {
                    "type": "integer",
                    "description": "Last line to read (1-indexed, inclusive). Omit to read to end."
                }
            },
            "required": ["path"],
            "additionalProperties": False
        }
    },
    {
        "name": "search_code",
        "description": "Search for a text pattern across the codebase. "
                       "Returns matching lines with file paths and line numbers. "
                       "Use to find where a function, class, or variable is defined or used.",
        "parameters": {
            "type": "object",
            "properties": {
                "pattern": {
                    "type": "string",
                    "description": "Text or regex pattern to search for."
                },
                "file_glob": {
                    "type": "string",
                    "description": "Glob to limit search scope (e.g., '*.py'). Omit to search all files."
                }
            },
            "required": ["pattern"],
            "additionalProperties": False
        }
    },
]

A few design decisions worth calling out. The read_file tool supports line ranges so the agent can read a specific function instead of loading a 2,000-line file into context. The search_code tool supports globs so the agent can narrow the search when it knows the language. Those are context management tools. Every token the agent spends reading irrelevant code is a token it cannot spend reasoning about the change.

Writing Code #

The edit tool is the most consequential tool in the set, and the hardest to get right. There are three common approaches, each with different trade-offs.

Full file rewrite. The agent outputs the entire file contents. Simple to implement, but expensive for large files and prone to accidentally dropping code the model forgot to include. Works well for small files and new file creation.

Search-and-replace. The agent specifies an exact string to find and the string to replace it with. Precise and cheap, but brittle — if the search string matches multiple locations or matches nothing because of whitespace differences, the edit fails.

Diff-based. The agent outputs a unified diff. Compact and explicit, but models struggle to get the line counts in diff headers right. The format requires counting lines before writing, which is exactly the kind of bookkeeping that language models are bad at.

In practice, search-and-replace tends to hit the best balance. It forces the agent to be precise about what it is changing, the output is short enough to keep token costs down, and failures are easy to detect and report back.

{
    "name": "edit_file",
    "description": "Replace an exact string in a file with new content. "
                   "The old_string must match exactly one location in the file. "
                   "Include 2-3 lines of surrounding context to ensure a unique match.",
    "parameters": {
        "type": "object",
        "properties": {
            "path": {
                "type": "string",
                "description": "Relative path to the file."
            },
            "old_string": {
                "type": "string",
                "description": "Exact text to find. Must appear exactly once in the file."
            },
            "new_string": {
                "type": "string",
                "description": "Text to replace old_string with."
            }
        },
        "required": ["path", "old_string", "new_string"],
        "additionalProperties": False
    }
}

The description does real work here. "Must appear exactly once" tells the model to include enough surrounding context to disambiguate. Without that instruction, the model tends to match on a single line, which often appears more than once. The error handling is straightforward: if the string matches zero times, tell the agent "no match found"; if it matches more than once, tell the agent "ambiguous match, include more context." Both are observations the model can act on in the next turn.

Running Commands #

The agent needs to execute commands in the project's environment — running tests, linters, type checkers, build tools. A single run_command tool with a sandboxed shell is the most flexible approach.

{
    "name": "run_command",
    "description": "Execute a shell command in the project directory. "
                   "Use to run tests, linters, type checkers, or build commands. "
                   "Returns stdout, stderr, and the exit code. "
                   "Commands time out after 60 seconds.",
    "parameters": {
        "type": "object",
        "properties": {
            "command": {
                "type": "string",
                "description": "The shell command to run."
            }
        },
        "required": ["command"],
        "additionalProperties": False
    }
}

The timeout is important. Without it, a model that generates an infinite loop or a command that hangs will block the agent forever. Sixty seconds is generous for most test suites; adjust based on your codebase.

The sandbox is even more important. A coding agent with unrestricted shell access can rm -rf /, install backdoors, or exfiltrate source code. At minimum, the sandbox should restrict filesystem access to the project directory, block network access to anything outside the test environment, and run commands as an unprivileged user. Container-based sandboxes — a fresh container per task, destroyed after completion — are the gold standard for isolation.

The Feedback Loop #

The feedback loop is what makes coding agents work. It is also what makes them different from agents in other domains. The loop follows a pattern: edit, run, observe, fix. The tighter this loop, the more reliably the agent converges on a working solution.

Edit-Test-Fix #

The simplest and most common feedback loop is edit-test-fix. The agent makes a change, runs the tests, and if they fail, reads the error output and tries again.

MAX_ATTEMPTS = 5

def edit_test_fix(task: str, test_command: str) -> dict:
    plan = call_model(
        prompt=f"Analyze this task and plan your changes:\n{task}",
        tools=CODING_TOOLS,
    )

    for attempt in range(MAX_ATTEMPTS):
        # Agent edits files using tools (may take multiple tool calls)
        run_agent_loop(
            goal=f"Implement the changes:\n{plan}",
            tools=CODING_TOOLS,
            max_steps=10,
        )

        # Run the test suite
        result = execute_command(test_command, timeout=60)

        if result.exit_code == 0:
            return {"status": "success", "attempts": attempt + 1}

        # Feed errors back to the agent
        run_agent_loop(
            goal=f"""The tests failed. Fix the code.

Test command: {test_command}
Exit code: {result.exit_code}
stdout:
{truncate(result.stdout, max_tokens=2000)}
stderr:
{truncate(result.stderr, max_tokens=2000)}""",
            tools=CODING_TOOLS,
            max_steps=10,
        )

    return {"status": "failed", "attempts": MAX_ATTEMPTS}

A few things matter here. The error output is truncated before it goes back into the prompt. A test suite with 200 failing tests produces output that can overwhelm the context window and drown the signal in noise. Truncating to the first few failures keeps the agent focused on the immediate problem.

The attempt limit prevents the agent from looping forever on a problem it cannot solve. Five attempts is a reasonable default — most fixable issues resolve in two or three rounds. If the agent has not converged by attempt five, the problem usually requires a different approach, not more iterations.

Layered Feedback #

Tests are the strongest signal, but they are not the only one. A production coding agent typically runs multiple feedback tools in sequence, from cheapest to most expensive:

  1. Syntax check. Does the code parse? This catches missing brackets, bad indentation, and other surface errors. It runs in milliseconds and costs nothing.
  2. Linter / static analysis. Does the code follow the project's style rules? Are there unused imports, unreachable code, or common mistakes? This catches a class of errors that tests might miss.
  3. Type checker. Do the types align? In typed languages, the type checker catches interface mismatches before the tests even run.
  4. Tests. Does the code do what it should? This is the definitive check, but also the slowest.
  Edit
   │
   ▼
┌────────────┐  fail   ┌──────────────────┐
│  Syntax    │────────▶│  Fix and retry   │
│  check     │         └──────────────────┘
└─────┬──────┘
      │ pass
      ▼
┌────────────┐  fail   ┌──────────────────┐
│  Linter    │────────▶│  Fix and retry   │
└─────┬──────┘         └──────────────────┘
      │ pass
      ▼
┌────────────┐  fail   ┌──────────────────┐
│  Type      │────────▶│  Fix and retry   │
│  check     │         └──────────────────┘
└─────┬──────┘
      │ pass
      ▼
┌────────────┐  fail   ┌──────────────────┐
│  Tests     │────────▶│  Fix and retry   │
└─────┬──────┘         └──────────────────┘
      │ pass
      ▼
   Done

Running checks in this order catches cheap errors early. There is no point burning a 60-second test run when the file has a syntax error the parser can find instantly. Each layer acts as a gate — the same concept from prompt chaining — stopping the flow before more expensive work runs on broken code.

Error Compaction #

As the agent iterates, the conversation history grows. Every failed test, every error traceback, every intermediate edit — it all accumulates in the context window. By round three, the context might be full of obsolete errors from earlier attempts that the agent already fixed. This is noise, and noise degrades the model's reasoning.

Error compaction is the practice of summarizing or pruning older feedback before injecting it back into the prompt. The simplest approach is to keep only the most recent error output and drop everything older. A more sophisticated approach is to summarize what was tried and why it failed, giving the agent a compressed history.

def compact_history(attempts: list[dict], keep_last_n: int = 2) -> str:
    if len(attempts) <= keep_last_n:
        return format_full_history(attempts)

    old = attempts[:-keep_last_n]
    recent = attempts[-keep_last_n:]

    summary = call_model(
        prompt=f"""Summarize these previous failed attempts in 2-3 sentences.
Focus on what was tried and why it failed.

{json.dumps(old, indent=2)}""",
        max_tokens=200,
    )

    return f"Previous attempts summary: {summary}\n\n{format_full_history(recent)}"

This directly applies the context engineering principles from the context engineering article — the context window is a fixed budget, and you get the best results by filling it with the most relevant information, not the most recent.

Sandboxing and Safety #

A coding agent that can run arbitrary shell commands is a powerful tool and a serious risk. The same capabilities that let it run tests and install dependencies also let it execute anything. Sandboxing is not optional — it is a load-bearing part of the architecture.

Isolation Boundaries #

The minimal isolation model uses containers. Each task gets a fresh container with the project code mounted in. The agent can do whatever it wants inside the container, and when the task finishes, the container is destroyed. Only the file changes are extracted.

┌──────────────────────────────────┐
│          Host System             │
│                                  │
│  ┌────────────────────────────┐  │
│  │       Container            │  │
│  │                            │  │
│  │  ┌──────────┐  ┌────────┐  │  │
│  │  │  Agent   │  │Project │  │  │
│  │  │  process │  │  code  │  │  │
│  │  └──────────┘  └────────┘  │  │
│  │                            │  │
│  │  No network (or allowlist) │  │
│  │  No host filesystem access │  │
│  │  Resource limits (CPU/RAM) │  │
│  └────────────────────────────┘  │
│                                  │
└──────────────────────────────────┘

The container provides three things. Filesystem isolation: the agent cannot read or modify anything outside the project. Network isolation: the agent cannot make outbound calls (or only to an allowlist of package registries). Resource limits: CPU and memory caps prevent runaway processes from affecting the host.

Some systems add a second layer: a command allowlist. Instead of giving the agent a raw shell, they define which commands the agent can run — pytest, npm test, cargo build — and reject anything else. This trades flexibility for safety. An agent that can only run pre-approved commands cannot curl its way to trouble, but it also cannot troubleshoot unexpected environment issues.

The Least Privilege Principle #

Even inside a sandbox, the agent should operate with the minimum permissions it needs. If the task is fixing a bug in one module, the agent does not need write access to the entire codebase. Scoping access narrows the blast radius when things go wrong.

In practice, this means:

  • Read-only by default. The agent can read any file, but only write to files in the directories relevant to the task.
  • No credential access. API keys, database passwords, and secrets should be stripped from the environment before the agent starts.
  • Audit logging. Every file write and command execution is logged. If the agent does something unexpected, you have a full trace to review.

Codebase-Scale Context #

A major challenge for coding agents is context. Real codebases are large — thousands of files, hundreds of thousands of lines. The agent cannot read everything, so it needs strategies for finding the right code to read.

Search Before Edit #

The most important habit for a coding agent is search before edit. Before changing anything, the agent should find all the places a function is called, find where a type is defined, and find the tests that cover the code it is about to change. This is the same thing experienced developers do, and it prevents a whole class of errors where the agent fixes one call site but breaks three others it did not know about.

This is why the search_code tool is as important as the edit_file tool. Without it, the agent is editing blind.

Providing Context Upfront #

The agent's system prompt should include structural context about the project: the language, the framework, the test runner, the directory layout. This is similar orientation a new developer would get on their first day.

SYSTEM_PROMPT = """You are a coding agent working on a Python project.

Project structure:
- src/         Application source code
- tests/       Test files (pytest)
- pyproject.toml  Project configuration

Conventions:
- Tests mirror source structure: src/auth/login.py -> tests/auth/test_login.py
- Run tests with: pytest tests/
- Run linter with: ruff check src/
- Type checking: mypy src/

When making changes:
1. Read the relevant source files first
2. Search for all usages of any function or class you plan to change
3. Edit the code
4. Run the linter, then the tests
5. Fix any failures before finishing
"""

This prompt does not tell the model how to code. It tells the model how this project works — which is exactly the information it cannot infer from the code alone.

Retrieval-Augmented Context #

For large codebases, static system prompt context is not enough. The agent needs dynamic retrieval — the ability to find relevant code based on the task. This is where the RAG patterns come in.

A code-aware retrieval system indexes the codebase by file, by function, and by symbol. When the agent receives a task, the system retrieves the most relevant files and injects them into the context before the agent starts reasoning. This pre-fetching is far more efficient than letting the agent search from scratch — it gives the agent a head start.

def prepare_context(task: str, codebase_index) -> str:
    # Retrieve relevant files based on the task description
    relevant_files = codebase_index.search(task, top_k=10)

    context = "Relevant files for this task:\n\n"
    for file_info in relevant_files:
        context += f"--- {file_info.path} ---\n"
        context += file_info.content + "\n\n"

    return context

The risk with pre-fetching is retrieving the wrong files. If the retrieval is wrong, the agent starts with a misleading picture of the codebase and may make changes that look correct in isolation but conflict with code it never saw. Evaluation of retrieval quality — tracking how often the retrieved files include the files the agent actually ends up editing — is essential for tuning.

Multi-File Changes and the Orchestrator Pattern #

Simple bug fixes touch one file. Real features touch ten. When a task requires changes across multiple files — a new API endpoint that needs a route, a handler, a model, a migration, and tests — the single-loop agent pattern breaks down. The agent loses track of what it has already done, misses files that need updating, or makes changes that are internally consistent but conflict with each other.

This is where the orchestrator-workers pattern earns its place. A central agent — the orchestrator — analyzes the task, identifies which files need to change and what each change should accomplish, then delegates each file change to a worker. Each worker operates in its own focused context with just the relevant file and its dependencies.

  Task: "Add user profile endpoint"
              │
              ▼
       ┌──────────────┐
       │ Orchestrator │
       │ (plan files  │
       │  to change)  │
       └──────┬───────┘
              │
  ┌───────────┼────────┬────────┬──────────┐
  ▼           ▼        ▼        ▼          ▼
┌───────┐┌────────┐┌──────┐┌──────────┐┌──────┐
│route  ││handler ││model ││migration ││tests │
│       ││        ││      ││          ││      │
└──┬────┘└──┬─────┘└──┬───┘└───┬──────┘└──┬───┘
   │        │         │        │          │
   └────────┴─────────┴────────┴──────────┘
                 │
                 ▼
          ┌──────────────┐
          │ Orchestrator │
          │ (verify all  │
          │  changes)    │
          └──────────────┘

The orchestrator's plan is the critical output. It needs to specify not just which files to change, but how each change relates to the others — the route handler expects a response shape that the model class must produce, the migration must create the columns the model references, the tests must cover the new endpoint. Without this coordination, the workers produce changes that do not fit together.

After all workers complete, the orchestrator runs the full test suite against all changes together. If something breaks, the orchestrator identifies which file change caused the failure and delegates the fix to a targeted worker. This synthesis step is what distinguishes orchestrator-workers from just running five independent edits — it ensures the combined result is coherent.

Cost and Iteration Budgets #

Coding agents burn tokens fast. Every file read, every test run's output, every error traceback — it all flows through the context window. A complex bug fix might take ten tool calls across three edit-test-fix iterations, each one consuming input and output tokens.

The budgeting strategies from workflow orchestration apply directly:

Token budgets. Set a maximum total token spend per task. When the budget runs out, the agent returns whatever it has — partial fix, work-in-progress branch — rather than continuing. This prevents runaway costs on tasks the agent is struggling with.

Iteration limits. Cap the number of edit-test-fix cycles. Three to five iterations covers the vast majority of fixable issues. If the agent has not solved it by iteration five, it probably needs a different approach, not more attempts at the same one.

Model tiering. Use a capable model for planning and the initial implementation, then drop to a cheaper model for fixing linter warnings and formatting issues. The creative work needs the big model; the mechanical cleanup does not.

BUDGET = {
    "max_tokens": 200_000,
    "max_iterations": 5,
    "max_tool_calls": 30,
    "models": {
        "plan": "large-model",
        "implement": "large-model",
        "fix_lint": "small-model",
        "fix_types": "small-model",
    }
}

Track actual token usage per task over time. You will find that 80% of tasks finish in two iterations and the remaining 20% account for most of the cost. Setting tighter budgets on that long tail — or routing those tasks to a human — saves more money than optimizing the average case.

When Coding Agents Fail #

Coding agents are impressive on focused tasks: fix this bug, add this test, refactor this function. They struggle in predictable ways on others.

Ambiguous requirements. "Make the auth module more robust" is not a task an agent can verify against. Without clear acceptance criteria — specific tests, specific behavior changes — the agent has no signal to guide its edits. It will make changes, but it cannot tell whether those changes are right.

Deep architectural changes. Moving a codebase from a monolith to microservices requires understanding that spans the entire system. The agent can see individual files, but it struggles to hold the full architectural picture in context. This is a context window limitation, not a reasoning limitation — the information simply does not fit.

Unfamiliar patterns. If the codebase uses a framework the model has rarely seen in training data, the agent's code will look plausible but miss idiomatic patterns. It will write code that compiles but does not follow the conventions the team expects.

Flaky tests. If the test suite is unreliable — tests that pass sometimes and fail other times for reasons unrelated to the code — the feedback loop is poisoned. The agent will chase phantom failures, burning iterations on problems that do not exist. A reliable test suite is a prerequisite, not a nice-to-have.

The right response to these failure modes is not "make the agent smarter." Scope the agent's tasks to things it can verify, give it a codebase with good tests, and route the hard stuff to humans.

Conclusion #

Coding agents are the clearest example of what happens when agent fundamentals — the ReAct loop, tool use, feedback, planning — land in a domain with objective verification. The tight edit-test-fix cycle gives the agent something most domains cannot: a real signal to learn from on every iteration.

Key takeaways:

  • Coding agents are ReAct loops with specialized tools for reading, editing, and executing code — the architecture is not new, but the domain makes the feedback loop uniquely effective
  • The edit tool format matters: search-and-replace gives the best balance of precision, cost, and failure transparency compared to full rewrites or diffs
  • Layered feedback — syntax, lint, types, tests — catches cheap errors early and reserves expensive test runs for code that already passes basic checks
  • Error compaction prevents the context window from filling with obsolete failures from earlier iterations
  • Sandboxing with containers, network isolation, and least-privilege access is a structural requirement, not an afterthought
  • The orchestrator-workers pattern handles multi-file changes by planning which files to touch, delegating each change to a focused worker, and verifying the combined result
  • Budget every task with token caps, iteration limits, and model tiering to prevent runaway cost on hard problems
  • Scope agent tasks to things the test suite can verify — ambiguous requirements and flaky tests undermine the feedback loop that makes coding agents work