Multi-Agent Systems

Publish at:

Everything up to this point has been about a single agent: one model, one tool set, one system prompt, one ReAct loop. That architecture handles a surprisingly wide range of tasks — from coding to computer use — but it starts to buckle when the task grows beyond what one agent can hold in its head. A legal review that requires checking regulatory compliance, verifying financial figures, and assessing reputational risk is not three steps in one plan — it is three different jobs, each needing different expertise, different tools, and potentially different models. Cramming all of that into one system prompt creates a bloated, confused agent that does everything poorly.

A multi-agent system splits work across multiple agents, each with its own system prompt, tool set, and responsibilities. The agents communicate — passing messages, sharing artifacts, requesting sub-tasks — and something coordinates the flow between them. That "something" is the central design question. Is it hardcoded in the developer's code? Is it a lead agent that delegates? Do the agents negotiate among themselves? Each answer leads to a different architecture with different trade-offs.

What follows covers why you would split work across agents and when it makes sense. The specific patterns — sequential, parallel, coordinator, hierarchical, swarm — each deserve their own deep dive. Here, we lay the conceptual foundation: what a multi-agent system actually is, the design forces that push you toward one, the anatomy of inter-agent communication, and the costs you take on the moment you go multi-agent.

Why Multiple Agents #

A single agent hits practical limits in several dimensions. Understanding these limits is the first step toward knowing when multi-agent systems earn their complexity.

Context Window Pressure #

Every agent runs inside a finite context window. The system prompt, tool definitions, conversation history, tool results — they all compete for the same token budget. A single agent responsible for regulatory compliance, financial analysis, and risk assessment needs tool definitions for all three domains, background knowledge for all three, and enough working memory to switch between them. The context fills up fast, and as it fills, performance degrades — the model loses track of earlier information, follows instructions less precisely, and makes more errors.

Splitting into three specialized agents gives each one a focused context. The compliance agent's context window contains only compliance-relevant tools, rules, and documents. Its system prompt is short and specific. It does not need to know anything about financial modeling, so it never has to hold that information. Each agent uses its context budget for what it actually needs, which means better recall, more precise instruction following, and fewer errors.

Specialization #

A single system prompt that tries to cover multiple domains ends up being a long, contradictory document. "You are an expert in regulatory compliance, financial analysis, and reputational risk assessment" is not a meaningful instruction — it is three different jobs stapled together. The model cannot be an expert in all three simultaneously; it will optimize for whichever domain the most recent messages emphasize and let the others drift.

Separate agents allow separate system prompts, each tuned for a specific role. The compliance agent has a prompt that says "You are a regulatory compliance specialist. Your job is to identify violations of the following regulations..." — clear, focused, and testable. The financial agent has a completely different prompt with different constraints and different examples. This is the same principle behind prompt engineering: a focused, specific prompt outperforms a general one. Multi-agent systems are, at one level, just a way to use multiple focused prompts on different parts of the same problem.

Tool Isolation #

Agents get better at tool selection when they have fewer tools to choose from. A single agent with thirty tools — database queries, web search, document parsers, calculators, email senders, code executors — has to pick the right one at every step from a large menu. Models are good at this, but not perfect. The more tools in the list, the higher the chance of picking the wrong one or hallucinating a tool that does not exist.

Splitting into agents that each own a small, focused tool set reduces this problem. The research agent has search tools. The coding agent has file operations and a shell. The communication agent has email and messaging tools. Each agent's tool selection problem is simpler, which makes it more reliable.

Model Routing #

Not every sub-task needs the same model. A classification step that routes incoming requests can use a small, fast, cheap model. A complex reasoning step that synthesizes findings from multiple sources needs a large, capable, expensive model. A single-agent architecture forces every step through the same model. A multi-agent architecture lets each agent use the model that matches its task — small models for simple tasks, large models for hard ones. This is a direct cost optimization: you pay for capability only where you need it.

                    ┌──────────────────────────────┐
                    │       Incoming Task          │
                    └──────────────┬───────────────┘
                                   │
                                   ▼
                    ┌──────────────────────────────┐
                    │     Router / Coordinator     │
                    │     (small, fast model)      │
                    └──────┬──────────┬────────────┘
                           │          │
              ┌────────────┘          └─────────────┐
              ▼                                     ▼
  ┌───────────────────────┐           ┌───────────────────────┐
  │   Specialist Agent A  │           │   Specialist Agent B  │
  │  (large model, tools  │           │  (medium model, tools │
  │   for domain A)       │           │   for domain B)       │
  └───────────────────────┘           └───────────────────────┘

Independent Evaluation #

When a single agent produces output and evaluates it, the evaluator is biased — it generated the output in the first place. This is why the evaluator-optimizer pattern from workflow orchestration works better than self-evaluation: the evaluator is a separate model call with a separate prompt. Multi-agent systems take this further. A dedicated reviewer agent, with its own system prompt and its own criteria, can evaluate the output of a generator agent without the conflict of interest that comes from self-evaluation. The reviewer never saw the generation process, so it judges the output on its own terms.

When to Stay Single-Agent #

Multi-agent systems are not an upgrade you apply to every project. They are overhead, and like all overhead, they need to justify themselves. Here is when a single agent is the better choice.

The task fits one domain. If the agent only needs one set of tools and one area of expertise — answering customer questions from a knowledge base, generating code from a spec, summarizing documents — a single agent with a well-crafted system prompt is simpler and cheaper.

The context window is not under pressure. If the system prompt, tool definitions, and typical conversation fit comfortably within the model's context window without degradation, there is no reason to split. Splitting introduces communication overhead that is worse than a slightly crowded context.

Latency matters. Every inter-agent message adds at least one model call. A multi-agent system with three sequential agents turns one model call into at least three, plus the overhead of message passing and coordination. For real-time applications where every second counts, the simplest single-agent architecture is often the only one that meets the latency budget.

You cannot define clear boundaries. If you cannot draw a clean line between responsibilities — if every sub-task depends heavily on what every other sub-task is doing — then splitting into agents just moves the complexity into the communication layer. Agents work best when they have cohesive responsibilities and loose coupling between them, the same principles that govern good software architecture.

The decision framework is pragmatic: start with a single agent. Measure where it fails. If it fails because the context is overloaded, the system prompt is incoherent, or tool selection is unreliable — those are signals that splitting might help. Do not start multi-agent because it sounds more sophisticated. Start multi-agent because the single agent is measurably failing.

Anatomy of a Multi-Agent System #

Every multi-agent system has three structural elements: the agents themselves, the communication mechanism, and the coordination logic. Understanding these pieces gives you a vocabulary for comparing different architectures.

Agents as Components #

Each agent in a multi-agent system is a self-contained unit with its own:

  • System prompt — defining the agent's role, expertise, and constraints
  • Tool set — the specific tools this agent can use
  • Model — potentially different from other agents (size, capability, cost)
  • Memory — its own conversation history and, optionally, shared memory with other agents

An agent does not need to know it is part of a multi-agent system. From its perspective, it receives a message, processes it using its tools and reasoning, and produces a response. The fact that the message came from another agent rather than a user is often transparent — the communication protocol handles the framing.

class Agent:
    """A single agent within a multi-agent system."""

    def __init__(self, name, system_prompt, tools, model):
        self.name = name
        self.system_prompt = system_prompt
        self.tools = tools
        self.model = model
        self.conversation_history = []

    def run(self, message):
        """Run the agent's ReAct loop on a message and return a response."""
        self.conversation_history.append({"role": "user", "content": message})

        while True:
            response = call_model(
                model=self.model,
                system=self.system_prompt,
                messages=self.conversation_history,
                tools=self.tools,
            )

            if response.has_tool_calls:
                results = execute_tools(response.tool_calls, self.tools)
                self.conversation_history.append(response.to_message())
                self.conversation_history.append(tool_results_message(results))
            else:
                self.conversation_history.append(response.to_message())
                return response.text

This is the same ReAct loop we covered earlier, wrapped in a class. Each agent is an instance with different parameters. The multi-agent architecture lives outside this class — in the code that creates agents, routes messages between them, and assembles their outputs.

Communication #

Agents need to exchange information. The communication mechanism determines how messages flow, what format they take, and who can talk to whom. There are four common structures.

Direct messaging. Agent A sends a message directly to Agent B and waits for a response. This is the simplest pattern — it looks like a function call. The coordinator (or the developer's code) knows which agent to call and passes the message explicitly. It works well for sequential workflows where the flow is predetermined.

Shared message pool. All agents publish messages to a shared space. Each agent subscribes to messages relevant to its role, based on tags or topics. This decouples the agents — the sender does not need to know the receiver. It works well when multiple agents might need the same information and you want to avoid redundant messages.

Hierarchical messaging. Messages flow through a tree structure. A coordinator sends tasks to sub-agents, sub-agents might delegate further to their own sub-agents, and results flow back up. Each agent only communicates with its parent and children. This gives clean separation of concerns but adds latency proportional to the tree depth.

Peer-to-peer. Every agent can message any other agent directly, without going through a coordinator. This is the most flexible but hardest to debug — message flows can be complex and hard to trace. Swarm architectures use this pattern.

Direct Messaging          Shared Message Pool

  A ──────▶ B              A ──▶ ┌───────┐ ◀── C
  │                              │ Pool  │
  └──────▶ C               B ──▶ └───────┘ ◀── D
  Fixed routes.             Publish/subscribe.

Hierarchical              Peer-to-Peer

       A                    A ◀──▶ B
      / \                   │ ╲   ╱ │
     B   C                  │  ╲ ╱  │
    / \                     ▼   ╳   ▼
   D   E                    C ◀──▶ D
  Tree routing.             Any-to-any.

The choice of communication structure has direct consequences for debuggability, latency, and cost. Direct messaging and hierarchical messaging produce linear or tree-shaped traces that are easy to follow. Shared message pools and peer-to-peer create complex graphs that are harder to trace and reason about.

Message Format #

What agents send each other matters as much as how they send it. The simplest format is plain text — one agent's natural language output becomes another agent's input. This works, but it has a problem: the receiving agent has to parse the meaning from unstructured text, which is ambiguous and lossy.

A better approach is structured handoffs: the message between agents includes metadata alongside the content.

handoff = {
    "from_agent": "researcher",
    "to_agent": "writer",
    "task": "Write a summary of the following findings.",
    "artifacts": [
        {
            "type": "research_findings",
            "content": "The study found that...",
            "confidence": 0.85,
            "sources": ["doc_a.pdf", "doc_b.pdf"],
        }
    ],
    "constraints": {
        "max_length": 500,
        "tone": "formal",
        "audience": "executive",
    },
}

The receiving agent's system prompt tells it how to interpret these fields. This structure makes handoffs predictable and testable — you can validate the schema, check that required fields are present, and log the exact payload for debugging.

Coordination Logic #

Someone has to decide which agent runs when, what inputs it gets, and what to do with its output. This coordination logic is the backbone of a multi-agent system, and it takes one of two forms.

Developer-defined coordination means the developer writes code that controls the flow. Agent A runs, then its output goes to Agent B, then both outputs go to Agent C. This is the workflow approach from the orchestration article — predictable, debuggable, and easy to reason about.

Agent-driven coordination means a lead agent (sometimes called a coordinator, orchestrator, or manager) decides at runtime which agents to call, in what order, and how to combine their results. This is more flexible — the coordinator can adapt to unexpected inputs — but harder to predict and debug. The coordinator itself is a full agent with its own ReAct loop, and its decisions are non-deterministic.

# Developer-defined coordination: fixed topology
research = research_agent.run(task)
analysis = analysis_agent.run(research)
report = writing_agent.run(analysis)

# Agent-driven coordination: dynamic routing
coordinator = Agent(
    name="coordinator",
    system_prompt="""You are a project coordinator. Break the task into
    sub-tasks and delegate to the appropriate specialist agent.
    Available agents: researcher, analyst, writer.
    Call transfer_to_agent(agent_name, task_description) to delegate.""",
    tools=[transfer_to_agent],
    model="large-model",
)
result = coordinator.run(task)

The first approach is a pipeline — you can read the code and know exactly what will happen. The second approach is emergent — the coordinator decides the flow at runtime based on the task. In practice, most systems use developer-defined coordination for the overall structure and agent-driven coordination within specific nodes where flexibility is needed.

Agent Profiling #

How an agent is characterized — its role, personality, expertise, and constraints — is called profiling. Profiling determines what an agent pays attention to, how it reasons, and what it prioritizes. In a single-agent system, profiling is just the system prompt. In a multi-agent system, it becomes a design tool for creating differentiated specialists.

There are three approaches to profiling:

Pre-defined profiles. The developer writes each agent's persona by hand. "You are a senior regulatory compliance analyst with expertise in financial regulations. You are detail-oriented, cautious, and always cite specific regulation numbers." This is the most common approach and the easiest to control. You know exactly what each agent will prioritize because you wrote the instructions.

Model-generated profiles. A setup agent generates profiles for other agents based on the task. Given a task like "Review this legal contract," the setup agent might create a compliance agent, a liability agent, and a financial terms agent, writing system prompts for each. This is useful when the decomposition depends on the specific input, but it adds a layer of unpredictability — the generated prompts might be vague or overlapping.

Data-derived profiles. Agent profiles are constructed from real data — for example, creating simulated users from a dataset of user preferences, or deriving expert personas from a corpus of expert writings. This is primarily used in research and simulation, not in typical production agent systems.

For production systems, pre-defined profiles are the right default. They are predictable, testable, and debuggable. Model-generated profiles are a power tool for situations where the task structure is not known in advance — but they require validation and monitoring to ensure the generated agents are actually differentiated and useful.

Communication Paradigms #

The way agents interact falls into three broad paradigms, and the choice shapes the entire system's behavior.

Cooperative #

Agents work toward a shared goal, dividing labor and combining results. This is the most common paradigm in production systems. A research agent gathers information, an analysis agent processes it, a writing agent produces the final output. The agents are not competing or arguing — they are each doing their part.

Cooperative systems work best when:

  • The task decomposes cleanly into independent or sequential sub-tasks
  • Each agent's output becomes another agent's input
  • There is a clear definition of "done"

Debate #

Agents present and defend competing viewpoints. A proposer generates an answer, an opponent critiques it, and they go back and forth until they converge or a judge decides. This paradigm directly exploits the fact that models are better at evaluating than generating — the critic catches errors the generator missed.

Debate is useful for:

  • Tasks where correctness matters more than speed (medical diagnosis, legal analysis)
  • Situations where a single perspective is likely to miss important considerations
  • Quality improvement through iterative refinement — the critique-revision loop

The trade-off is cost and latency. Every round of debate is multiple model calls, and convergence is not guaranteed. Without a stopping condition (maximum rounds, judge agent, convergence threshold), debate can loop indefinitely.

Competitive #

Agents pursue their own goals, which may conflict with other agents' goals. This is primarily a research and simulation paradigm — game theory experiments, market simulations, negotiation modeling. It is rare in production systems because adversarial dynamics between your own agents are usually not what you want.

The Cost of Going Multi-Agent #

Multi-agent systems are not free. Every benefit comes with a cost, and an honest assessment of these costs is essential before committing to the architecture.

Latency multiplication. Every inter-agent message involves at least one model call. A three-agent sequential pipeline turns one task into at least three model calls, plus tool calls within each agent. If the agents run in parallel, wall-clock time might be acceptable, but sequential dependencies multiply latency linearly.

Token cost. Each agent has its own system prompt, its own conversation history, and its own tool calls. The total tokens consumed by a multi-agent system is almost always higher than a single-agent system doing the same task, because each agent re-reads its own context independently. Shared context (the original task, background documents) is duplicated across agents.

Debugging complexity. A single agent has one conversation trace. A multi-agent system has multiple interleaved traces, and the interaction between agents can produce emergent behavior that is hard to reproduce and reason about. When something goes wrong, you have to figure out not just which agent made the mistake, but how the other agents' outputs contributed to the failure. Structured logging of every inter-agent message, with correlation IDs that link related calls is essential.

Coordination overhead. Someone has to manage the flow: deciding which agent runs, passing context, handling failures, merging results. In developer-defined coordination, this is code you have to write and maintain. In agent-driven coordination, this is another model call that can itself fail.

Cascading errors. When Agent A's output feeds into Agent B, an error in A's output propagates into B and potentially compounds. In a single-agent system, the agent can sometimes catch and correct its own mistakes in the next iteration. In a multi-agent system, Agent B has no visibility into Agent A's reasoning — it takes Agent A's output at face value. Bad input cascades through the system like a game of telephone.

Single Agent Error Path:

  Task → [Agent] → error → self-correct → good output

Multi-Agent Error Path:

  Task → [Agent A] → bad output → [Agent B] → worse output → [Agent C] → ???

Mitigating cascading errors requires validation gates between agents — programmatic checks that verify the output of one agent before passing it to the next. This is the same gating concept from prompt chaining, applied at the agent boundary.

Shared State and Memory #

Agents in a multi-agent system need to share information, but how much they share — and through what mechanism — is a design decision with real consequences.

No shared state. Each agent operates independently, receiving its input as a message and producing its output as a message. The coordination layer handles all routing. This is the simplest model and the easiest to reason about. Each agent is a pure function: input in, output out. The downside is that context must be passed explicitly in every message, which bloats the input and wastes tokens on repeated information.

Shared artifact store. Agents read from and write to a shared storage layer — a document store, a database, a file system. The research agent writes its findings to a shared location, and the analysis agent reads them. This avoids duplicating large artifacts in message payloads and gives all agents access to the same ground truth. The risk is coordination bugs: Agent B reads the store before Agent A has finished writing, or two agents write conflicting updates.

Shared message pool. All inter-agent messages go into a shared pool. Each agent subscribes to messages tagged with its role or interest. This is the approach used by systems like MetaGPT, where agents publish structured messages and other agents consume the ones relevant to them. It reduces redundant communication but requires a well-designed tagging and subscription scheme.

class SharedMemory:
    """A shared artifact store for multi-agent systems."""

    def __init__(self):
        self.artifacts = {}
        self.message_log = []

    def publish(self, agent_name, artifact_type, content):
        """An agent publishes a result to the shared store."""
        entry = {
            "agent": agent_name,
            "type": artifact_type,
            "content": content,
            "timestamp": now(),
        }
        self.artifacts.setdefault(artifact_type, []).append(entry)
        self.message_log.append(entry)

    def retrieve(self, artifact_type):
        """Retrieve all artifacts of a given type."""
        return self.artifacts.get(artifact_type, [])

    def summary_for(self, agent_name, relevant_types):
        """Build a context summary for a specific agent."""
        relevant = []
        for artifact_type in relevant_types:
            for entry in self.artifacts.get(artifact_type, []):
                if entry["agent"] != agent_name:
                    relevant.append(entry)
        return format_as_context(relevant)

The summary_for method is important: it builds a filtered, formatted context for a specific agent, containing only the artifacts relevant to its role. This keeps each agent's context focused rather than dumping the entire shared state into every agent's prompt.

Design Principles #

A few principles consistently separate multi-agent systems that work from ones that do not.

Start with one, split when needed. Build the simplest single-agent version first. Measure where it fails — context overflow, poor tool selection, inconsistent quality. Split only along the lines where the single agent is measurably struggling. This prevents over-engineering and ensures every agent in the system earns its existence.

Cohesive agents, loose coupling. Each agent should have a clear, focused responsibility. If you cannot describe what an agent does in one sentence, it is doing too much. Agents should communicate through well-defined interfaces (message schemas), not by sharing internal state. This makes agents independently testable and replaceable.

Validate at boundaries. Every time one agent's output becomes another agent's input, validate it. Check that required fields are present, that values are in expected ranges, that the format matches the schema. This catches cascading errors early, before they propagate through the system.

Make communication observable. Log every inter-agent message with a correlation ID that ties the entire task together. When debugging, you should be able to reconstruct the full message flow from the logs: which agent sent what to whom, in what order, and what the result was.

Prefer developer-defined coordination. Use hardcoded control flow (pipelines, fan-out/fan-in, conditional branches) for the overall structure. Reserve agent-driven coordination for the specific points where the task requires runtime flexibility. This gives you the predictability of workflows with the adaptability of agents, without fully committing to either.

Conclusion #

Multi-agent systems split work across specialized agents to overcome the limits of a single agent — context window pressure, tool selection complexity, the need for different expertise, and the benefits of independent evaluation. But the architecture is not free: it adds latency, token cost, debugging complexity, and the risk of cascading errors.

Key takeaways:

  • Split into multiple agents when a single agent's context is overloaded, its system prompt is incoherent, or its tool set is too large for reliable selection — not because multi-agent sounds more sophisticated
  • Each agent is a self-contained unit with its own system prompt, tools, model, and memory — the same ReAct loop from a single-agent system, wrapped in a component boundary
  • Communication structure (direct, shared pool, hierarchical, peer-to-peer) determines debuggability, latency, and flexibility — prefer simpler structures unless the task demands otherwise
  • Coordination logic is either developer-defined (predictable, debuggable) or agent-driven (flexible, less predictable) — most production systems use developer-defined coordination for the overall flow
  • Cascading errors are the primary risk of multi-agent systems — validate at every agent boundary and log every inter-agent message
  • Agent profiling (pre-defined, model-generated, or data-derived) determines specialization — pre-defined profiles are the right default for production systems
  • Model routing across agents is a direct cost optimization: use small models for simple tasks, large models for hard ones
  • Start single-agent, measure failures, split along the failure lines — this prevents over-engineering and ensures every agent earns its place