Workflow Orchestration

Published: 24 May 2026

Many real tasks do not fit neatly into a single agent loop. Generating a marketing email might require a draft step, a compliance check, and a translation step, each with different prompts and different quality criteria. Answering a customer support ticket might require classifying the issue before routing it to specialized logic. Evaluating a legal contract might benefit from three independent reviews whose results are merged.

These are workflow problems. The individual steps still use language models and tools, but the structure that connects them — what runs first, what runs in parallel, what gets checked before the next step — is defined in code, not improvised by the model at runtime. That is the key distinction: in a workflow, the developer controls the topology. The model controls the reasoning within each node.

We will walk through the five workflow patterns that show up most in production systems: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. For each, we will look at the mechanics, the implementation, the trade-offs, and when it earns its complexity. At the end, we will cover the practical side — choosing between patterns, controlling costs, and combining them.

Workflows vs. Autonomous Agents #

Before diving into patterns, it is worth drawing the line between workflows and autonomous agents clearly. Both are "agentic" — they use language models to reason and act. The difference is who decides the execution path.

In a workflow, the developer defines the control flow in advance. Step A runs, then step B, then step C. Or: classify the input, then route to one of three branches. The model does the work within each step, but the structure — the sequence, the branches, the fan-out — is hardcoded. You can read the orchestration logic and know what will happen, in the same way you can read a pipeline definition in a CI system.

In an autonomous agent, the model decides what to do next. A ReAct loop is the simplest form: the model reasons, picks a tool, observes the result, and decides the next move. The developer provides tools and guardrails, but the execution path is emergent — it depends on what the model encounters at runtime.

Workflow (developer-controlled)     Autonomous agent (model-controlled)

  ┌───────┐                             ┌───────────┐
  │ Step A│                             │  Thought  │◀─────┐
  └───┬───┘                             └─────┬─────┘      │
      │                                       │            │
      ▼                                       ▼            │
  ┌───────┐                             ┌───────────┐      │
  │ Step B│                             │  Action   │      │
  └───┬───┘                             └─────┬─────┘      │
      │                                       │            │
      ▼                                       ▼            │
  ┌───────┐                             ┌───────────┐      │
  │ Step C│                             │ Observe   │──────┘
  └───────┘                             └───────────┘
  Fixed path.                           Path emerges at runtime.
  Developer decides structure.          Model decides structure.

Neither is inherently better. Workflows give you predictability, debuggability, and cost control. Autonomous agents give you flexibility for tasks where the path cannot be known in advance. In practice, most production systems use both: workflows for the overall structure, with autonomous agent loops inside individual steps when reasoning flexibility is needed.

The five patterns that follow are all workflows. They trade the model's freedom for the developer's control — and for tasks with known structure, that is usually the right trade. Understanding the distinction matters because it determines where you invest your engineering effort: in defining the topology (workflow) or in designing guardrails for emergent behavior (agent).

Prompt Chaining #

Prompt chaining is the simplest workflow pattern: break a task into a fixed sequence of steps, run each step as a separate model call, and pass the output of one step as the input to the next. Between steps, you can add programmatic checks — gates — that validate the intermediate output before the chain continues. If a gate fails, the chain stops early with a clear error instead of propagating a bad result through the remaining steps.

  Input
    │
    ▼
┌────────┐      ┌──────┐     ┌────────┐      ┌──────┐     ┌────────┐
│ Step 1 │────▶ │ Gate │────▶│ Step 2 │────▶ │ Gate │────▶│ Step 3 │
└────────┘      └──────┘     └────────┘      └──────┘     └────────┘
                                                            │
                                                            ▼
                                                         Output

The idea is dead simple. Instead of asking a model to do everything at once — "analyze this report, find the key risks, and draft a summary email in Spanish" — you split it into focused steps: analyze the report, then extract risks, then draft the email, then translate. Each model call has one job, which means a simpler prompt, less room for the model to drift, and easier debugging when something goes wrong.

Here is a concrete implementation. Suppose you need to generate a product description, verify it meets brand guidelines, and then translate it:

def chain_generate_description(product: dict) -> str:
    # Step 1: Draft the description
    draft = call_model(
        prompt=f"Write a product description for:\n{json.dumps(product)}",
        max_tokens=500,
    )

    # Gate: check length and required sections
    if len(draft.split()) < 50:
        raise ChainError("Draft too short", step="draft", output=draft)
    if "features" not in draft.lower():
        raise ChainError("Missing features section", step="draft", output=draft)

    # Step 2: Brand compliance check
    review = call_model(
        prompt=f"""Review this product description for brand compliance.
Flag any issues. If compliant, respond with exactly "APPROVED".

Description:
{draft}""",
        max_tokens=300,
    )

    if review.strip() != "APPROVED":
        raise ChainError("Brand review failed", step="review", output=review)

    # Step 3: Translate
    translated = call_model(
        prompt=f"Translate this product description to Spanish:\n\n{draft}",
        max_tokens=600,
    )

    return translated

A few things to notice. Each step has its own prompt, tuned for exactly one job. The gates between steps are plain code — if statements, regex checks, word counts — not model calls. That makes them fast, deterministic, and free. When a gate fails, the error message tells you which step broke and what the intermediate output was, so debugging is straightforward.

When to use It #

Prompt chaining works when the task naturally decomposes into a fixed sequence of sub-tasks and each sub-task has a clear input-output contract. Good fits include:

Generate-then-review: draft content, then check it against criteria
Extract-then-transform: pull structured data from text, then reformat it
Multi-format pipelines: write in one language, translate to another, summarize

The pattern does not work well when the number of steps depends on the input, when steps need to run in parallel, or when the output of step 3 might require going back to step 1. Those cases call for orchestrator-workers or evaluator-optimizer patterns. If you find yourself adding conditional branches or retry loops inside a chain, you have outgrown the pattern.

Trade-offs #

Latency scales linearly with the number of steps. A three-step chain takes at least three model round trips. For latency-sensitive applications, this is a real cost — and a reason to keep chains short.

Cost also scales linearly, though you can optimize individual steps by using smaller, cheaper models for simpler sub-tasks. A common pattern is to use a capable model for the creative step and a smaller model for the review or translation step, keeping total cost down while maintaining quality where it matters. We will look at per-step model selection in more detail when we discuss budgets and cost control.

Rigidity is both the strength and the weakness. The fixed sequence is easy to reason about, but if you discover a new use case that needs a different ordering, you build a new chain. Chains are not composable the way function calls are — they are pipelines, and pipelines are only as flexible as their topology.

Routing #

Routing classifies an input and directs it to one of several specialized downstream handlers. Instead of building one massive prompt that tries to handle every kind of request, you build a classifier that sorts inputs into categories and a set of focused handlers that each deal with one category well. This separation of concerns means you can tune each handler independently without worrying about regressions on other categories.

                    Input
                      │
                      ▼
               ┌─────────────┐
               │ Classifier  │
               └──────┬──────┘
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
    ┌──────────┐ ┌──────────┐ ┌──────────┐
    │ Handler  │ │ Handler  │ │ Handler  │
    │    A     │ │    B     │ │    C     │
    └──────────┘ └──────────┘ └──────────┘

The classifier can be a model call, but it does not have to be. A lightweight classifier — keyword matching, a small fine-tuned model, or an embedding-based similarity check — is often faster and cheaper. The key design decision is where to draw the category boundaries and whether the classifier can say "I don't know" (in which case you need a fallback handler).

CATEGORIES = {
    "billing": "You handle billing inquiries: refunds, charges, payment methods.",
    "technical": "You troubleshoot technical issues: bugs, errors, configuration.",
    "general": "You answer general questions about the product.",
}

def route(query: str) -> str:
    # Step 1: Classify
    category = call_model(
        prompt=f"""Classify this customer query into one category.
Categories: {list(CATEGORIES.keys())}
Respond with the category name only.

Query: {query}""",
        max_tokens=10,
    )
    category = category.strip().lower()

    if category not in CATEGORIES:
        category = "general"  # fallback

    # Step 2: Handle with specialized prompt
    handler_prompt = CATEGORIES[category]
    response = call_model(
        prompt=f"""{handler_prompt}

Customer query: {query}""",
        max_tokens=500,
    )

    return response

Model Routing #

Routing is not limited to prompts. You can also route to different models. Easy questions go to a small, cheap model. Hard or ambiguous questions go to a larger, more capable one. This is cost optimization through classification — you pay the premium price only for inputs that need it.

def model_route(query: str) -> str:
    complexity = call_model(
        prompt=f"""Rate the complexity of this query: LOW, MEDIUM, or HIGH.
LOW = simple factual question or greeting.
MEDIUM = requires some reasoning or multi-step answer.
HIGH = ambiguous, multi-part, or requires nuanced judgment.

Query: {query}""",
        model="small-model",
        max_tokens=10,
    )

    if "HIGH" in complexity:
        model = "large-model"
    elif "MEDIUM" in complexity:
        model = "medium-model"
    else:
        model = "small-model"

    return call_model(prompt=query, model=model, max_tokens=1000)

The classifier itself should be cheap. If the routing step costs as much as just running the large model, you have gained nothing. In practice, a lightweight classifier — a small model with a short prompt, or even a traditional ML classifier trained on labeled examples — keeps the routing overhead low enough that the savings on the handler side pay for it many times over.

When to Use It #

Routing shines when you have distinct categories of input that benefit from different handling. Customer support systems are the canonical example: billing, technical, and general queries need different tool access, different tone, and different domain knowledge. A single prompt that handles all three well is much harder to write and maintain than three focused prompts behind a router.

Routing also works for model selection in cost-sensitive systems. If 70% of your queries are simple and can be handled by a cheap model, routing saves real money compared to running every query through the expensive model. The key is that the classification overhead must be substantially cheaper than the cost difference between the cheap and expensive paths.

Trade-offs #

Classification accuracy is the bottleneck. A misrouted query gets the wrong handler, and the user sees a confused or irrelevant response. You need to monitor misclassification rates and tune the classifier over time. Edge cases — queries that straddle two categories — are the most common source of errors.

Maintenance cost grows with the number of categories. Each handler has its own prompt and possibly its own tool set, which means more code to maintain. If you find yourself with fifteen routing categories, you may have over-specialized — a smaller number of broader categories with more capable handlers often performs better and is easier to manage.

Parallelization #

Some tasks break into independent subtasks that can run at the same time. Parallelization fans out model calls across these subtasks and merges the results. This gives you either speed (multiple steps running concurrently) or quality (multiple perspectives on the same input).

There are two variants: sectioning, which divides a task into independent pieces that run concurrently, and voting, which runs the same task multiple times for higher confidence.

Sectioning #

Sectioning splits a task into independent pieces, runs each piece as a separate model call in parallel, and aggregates the results. The subtasks do not depend on each other, so they can execute simultaneously. This is the workflow equivalent of map-reduce: fan out, do the work, collect the results.

             Input
               │
      ┌────────┼────────┐
      ▼        ▼        ▼
  ┌───────┐┌───────┐┌───────┐
  │ Sub-  ││ Sub-  ││ Sub-  │
  │ task 1││ task 2││ task 3│
  └───┬───┘└───┬───┘└───┬───┘
      │        │        │
      └────────┼────────┘
               ▼
          ┌─────────┐
          │ Combine │
          └─────────┘

import asyncio

async def parallel_review(document: str) -> dict:
    tasks = [
        call_model_async(
            prompt=f"Check this document for factual accuracy:\n\n{document}"
        ),
        call_model_async(
            prompt=f"Check this document for tone and readability:\n\n{document}"
        ),
        call_model_async(
            prompt=f"Check this document for legal compliance:\n\n{document}"
        ),
    ]

    results = await asyncio.gather(*tasks)

    return {
        "accuracy": results[0],
        "readability": results[1],
        "compliance": results[2],
    }

Each reviewer focuses on one dimension. The prompts are simpler and shorter than a single "review everything" prompt, and each model call can give its full attention to one concern. The results are combined programmatically — in this case, into a dictionary, but it could also be a model call that synthesizes the individual reviews into a coherent summary.

Voting #

Voting runs the same task multiple times — often with the same prompt but different temperature settings or slightly different phrasing — and aggregates the results. The goal is confidence through diversity: if three out of four runs agree, you can be more confident in that answer than in any single run. This is especially useful for classification tasks, where the model might be uncertain and a majority vote smooths out the noise.

async def vote_on_classification(text: str, n: int = 5) -> str:
    tasks = [
        call_model_async(
            prompt=f"Classify this text as POSITIVE, NEGATIVE, or NEUTRAL:\n\n{text}",
            temperature=0.7,
        )
        for _ in range(n)
    ]

    results = await asyncio.gather(*tasks)
    votes = [r.strip().upper() for r in results]

    # Majority vote
    from collections import Counter
    winner, count = Counter(votes).most_common(1)[0]

    return winner if count > n // 2 else "UNCERTAIN"

Voting is expensive — you are paying for n model calls instead of one. It is worth it when the cost of a wrong answer is high and the task is one where the model sometimes gets it right and sometimes wrong (as opposed to systematically wrong, where more votes will not help). Before adopting voting, run a quick experiment to check that the model's errors are actually random rather than consistent — if it gets the same wrong answer every time, five copies of the same mistake will not help.

When to Use It #

Parallelization is effective when the subtasks are genuinely independent. If subtask 2 depends on the result of subtask 1, you cannot parallelize them — you need chaining instead. Good candidates include:

Multi-aspect evaluation: checking a document against several independent criteria
Content guardrails: running a safety check in parallel with the main generation so the guard does not add latency to the happy path
Fan-out search: querying multiple data sources simultaneously and combining results

Trade-offs #

Cost multiplies linearly with the fan-out factor. Three parallel branches cost three times as much as a single call. For voting, this is pure redundancy cost — you are buying confidence.

Aggregation complexity depends on the task. Merging three text reviews is straightforward. Merging three partially overlapping code edits is not. If the subtasks are truly independent, aggregation is easy. If they overlap, you need conflict resolution logic — which can require its own model call.

Latency is determined by the slowest branch — you wait for all parallel calls to finish before aggregating. This is still faster than running them sequentially, but one slow branch holds everything up. Adding a per-branch timeout ensures that a single straggler does not bottleneck the entire workflow.

Orchestrator-Workers #

The orchestrator-workers pattern adds a layer of dynamic planning on top of parallelization. Instead of hardcoding the subtasks, a central model — the orchestrator — analyzes the input, decides what subtasks are needed, delegates them to worker model calls, and synthesizes the results. Think of it as the plan-and-execute pattern from the planning article, but with the execution phase fanned out in parallel rather than run sequentially.

               Input
                 │
                 ▼
        ┌────────────────┐
        │  Orchestrator  │
        │  (plan tasks)  │
        └────────┬───────┘
                 │
      ┌──────────┼─────────┐
      ▼          ▼         ▼
  ┌────────┐┌────────┐┌────────┐
  │Worker 1││Worker 2││Worker 3│
  └────┬───┘└────┬───┘└────┬───┘
       │         │         │
       └─────────┼─────────┘
                 ▼
          ┌──────────────┐
          │ Orchestrator │
          │ (synthesize) │
          └──────────────┘

This looks like parallelization, and the execution phase often is parallel. The difference is that the set of subtasks is not fixed in code — the orchestrator generates them dynamically based on the input. This makes the pattern suitable for tasks where the work breakdown depends on the specific request.

def orchestrator_workers(task: str, tools: list[dict]) -> str:
    # Phase 1: Plan
    plan = call_model(
        prompt=f"""Break this task into independent subtasks.
Return a JSON array of subtask descriptions.
Each subtask should be self-contained and executable independently.

Task: {task}
Available tools: {[t['name'] for t in tools]}""",
        max_tokens=500,
    )
    subtasks = parse_json(plan)

    # Phase 2: Execute (parallel)
    async def run_worker(subtask: str) -> str:
        return await call_model_async(
            prompt=f"Complete this subtask:\n{subtask}",
            tools=tools,
            max_tokens=1000,
        )

    async def run_all_workers() -> list[str]:
        return await asyncio.gather(*[run_worker(st) for st in subtasks])

    results = asyncio.run(run_all_workers())

    # Phase 3: Synthesize
    synthesis = call_model(
        prompt=f"""Original task: {task}
Subtask results:
{json.dumps(dict(zip(subtasks, results)), indent=2)}

Synthesize these results into a coherent response.""",
        max_tokens=1000,
    )

    return synthesis

The orchestrator makes two model calls — one to plan, one to synthesize — and the workers run in between. The total cost is 2 + N model calls for N subtasks, plus whatever tool calls the workers make. For complex tasks where the fan-out is large, this cost can climb quickly, so you may want to cap the maximum number of workers.

When to Use It #

Orchestrator-workers is the right pattern when the subtask breakdown is not predictable from the task type alone. Coding agents are the canonical example: given "refactor the authentication module to use JWT tokens," the orchestrator figures out which files need to change and how, then delegates each file change to a worker. The number of workers and their specific tasks depend entirely on the codebase and the request.

Search tasks are another good fit: the orchestrator identifies what information is needed, workers query different sources in parallel, and the orchestrator merges the findings. In both cases, the key property is that you cannot enumerate the subtasks at design time — the input determines the work.

Trade-offs #

Quality of the plan matters enormously. If the orchestrator misses a subtask or creates overlapping subtasks, the workers will produce incomplete or conflicting results. You can mitigate this by validating the plan — checking for coverage, checking that subtasks are independent — but the plan is only as good as the orchestrator's understanding of the task.

Cost is higher. Two orchestrator calls plus N worker calls adds up. For simple tasks, the planning overhead is not worth it — prompt chaining with hardcoded steps is cheaper and more predictable.

Synthesis is non-trivial. Merging results from multiple workers is easy when the outputs are independent (combine three separate analyses into one report). It is hard when workers produce overlapping or contradictory information. The synthesis step needs enough context to resolve conflicts, which means feeding it all the worker outputs — and that can push against context window limits for large fan-outs.

Evaluator-Optimizer #

The evaluator-optimizer pattern creates a refinement loop: one model call generates a response, another evaluates it, and the feedback drives another generation round. This repeats until the evaluator is satisfied or a maximum iteration limit is reached. It mimics how a human writer works with an editor — draft, get feedback, revise, repeat.

  Input
    │
    ▼
┌───────────┐
│ Generator │◀─────────────────┐
└─────┬─────┘                  │
      │                        │
      ▼                 ┌──────┴──────┐
┌───────────┐    No     │  Feedback   │
│ Evaluator │──────────▶│  (specific  │
└─────┬─────┘           │  critique)  │
      │                 └─────────────┘
      │ Yes (approved)
      ▼
   Output

def evaluate_and_optimize(
    task: str,
    max_rounds: int = 3,
) -> str:
    draft = call_model(
        prompt=f"Complete this task:\n{task}",
        max_tokens=1000,
    )

    for round_num in range(max_rounds):
        evaluation = call_model(
            prompt=f"""Evaluate this output for the given task.
If the output is satisfactory, respond with exactly "APPROVED".
If not, provide specific, actionable feedback for improvement.

Task: {task}
Output: {draft}""",
            max_tokens=500,
        )

        if evaluation.strip() == "APPROVED":
            return draft

        # Refine based on feedback
        draft = call_model(
            prompt=f"""Improve this output based on the feedback.

Task: {task}
Current output: {draft}
Feedback: {evaluation}""",
            max_tokens=1000,
        )

    return draft  # return best effort after max rounds

The critical element is the quality of the feedback. "This is not good enough" is useless. "The second paragraph contradicts the data in the first table — reconcile the revenue figures" gives the generator something to act on. The evaluator prompt should be designed to produce specific, actionable critique.

When to Use It #

This pattern fits tasks where iterative refinement is natural and where you have clear quality criteria. Good candidates include:

Writing tasks: draft, review for tone and accuracy, revise
Code generation: generate code, run tests, fix failures based on error messages
Translation: translate, back-translate to check fidelity, revise
Complex search: gather initial results, evaluate completeness, search for gaps

The pattern works best when the evaluator can meaningfully assess quality and the generator can meaningfully improve based on feedback. If the generator consistently produces the same output regardless of feedback, additional rounds are wasted cost. A quick diagnostic: run three rounds and compare the outputs — if rounds two and three are nearly identical, the loop is not buying you anything.

Trade-offs #

Cost scales with iterations. Each round is at least two model calls (evaluate + revise). Three rounds means six calls for what might have been one. Set a reasonable maximum and track how many rounds typical tasks need — if most tasks are approved on the first pass, the evaluator is adding cost without adding value.

Diminishing returns. The biggest quality jump is usually from round one to round two. By round three or four, improvements are marginal. Unbounded loops are a cost trap.

Evaluator-generator mismatch. If the evaluator uses the same model as the generator, it may have the same blind spots. Using a different model or a different prompting strategy for the evaluator can help. In some cases, the evaluator could be a code-based check (does the generated code pass tests? does the output parse as valid JSON?). These deterministic evaluators are faster, cheaper, and more reliable than model-based ones.

Practical Considerations #

Patterns are only useful if you can choose between them, control their costs, and combine them without creating a mess. This section covers all three.

Choosing a Pattern #

There is no universal "best" pattern. The right choice depends on the task structure, the predictability of the steps, and the acceptable cost.

Pattern	Control flow	Steps known in advance?	Typical cost	Best for
Prompt chaining	Sequential	Yes	N calls	Fixed multi-step pipelines
Routing	Branching	Yes (categories known)	1 classifier + 1 handler	Input-dependent specialization
Parallelization	Fan-out/merge	Yes	N parallel calls	Independent multi-aspect tasks
Orchestrator-workers	Dynamic fan-out	No (planned at runtime)	2 + N calls	Unpredictable work breakdown
Evaluator-optimizer	Loop	No (iterations vary)	2 × rounds	Quality through refinement

A few guidelines for choosing:

Start with prompt chaining. If the steps are known and sequential, chaining is the simplest and cheapest option. Only add routing, parallelization, or dynamic orchestration when the task genuinely requires it.

Use routing when specialization pays off. If a single prompt handles all your cases adequately, routing adds complexity for no benefit. If different categories need different tools, different models, or significantly different instructions, routing earns its place.

Use parallelization when subtasks are independent. The test is simple: can subtask B run without the result of subtask A? If yes, parallelize. If no, chain.

Use orchestrator-workers when the work breakdown is dynamic. If you cannot enumerate the subtasks in advance because they depend on the specific input, the orchestrator pattern is the right fit.

Use evaluator-optimizer when quality is more important than speed. If the task has clear quality criteria and iterative improvement is feasible, the refinement loop can substantially improve output quality — at the cost of latency and token spend.

Budgets and Cost Control #

Every workflow pattern multiplies the number of model calls compared to a single-shot approach. A three-step chain costs 3x. A parallel fan-out with an orchestrator costs N+2. An evaluator-optimizer loop costs 2x per round. These costs accumulate fast, and without explicit budgeting, a runaway workflow can burn through your API budget on a single request.

Token Budgets #

The simplest form of cost control is a per-request token budget. You allocate a maximum number of total tokens (input + output) for the entire workflow and track usage at each step. When the budget is exhausted, the workflow returns the best result it has so far rather than continuing.

class TokenBudget:
    def __init__(self, max_tokens: int):
        self.max_tokens = max_tokens
        self.used = 0

    def consume(self, tokens: int):
        self.used += tokens
        if self.used > self.max_tokens:
            raise BudgetExhausted(used=self.used, limit=self.max_tokens)

    @property
    def remaining(self):
        return max(0, self.max_tokens - self.used)


def chained_workflow(task: str, budget: TokenBudget) -> str:
    draft = call_model(prompt=f"Draft: {task}", max_tokens=min(500, budget.remaining))
    budget.consume(draft.usage.total_tokens)

    review = call_model(prompt=f"Review: {draft.text}", max_tokens=min(300, budget.remaining))
    budget.consume(review.usage.total_tokens)

    if budget.remaining < 100:
        return draft.text  # not enough budget for refinement

    final = call_model(prompt=f"Refine: {draft.text}\nFeedback: {review.text}",
                       max_tokens=min(500, budget.remaining))
    budget.consume(final.usage.total_tokens)

    return final.text

Step Limits #

Token budgets control cost. Step limits control runaway execution. Every loop-based pattern — evaluator-optimizer, ReAct loops inside workers — should have an explicit maximum iteration count. Without it, a poorly performing model can loop indefinitely, burning tokens on diminishing or non-existent improvements.

MAX_EVAL_ROUNDS = 3
MAX_WORKER_STEPS = 8
MAX_TOTAL_MODEL_CALLS = 15  # hard cap for the entire workflow

These limits should be tuned based on observation. Guessing does not work here. Run the workflow on representative inputs, track how many steps typical tasks need, and set the limit slightly above the observed maximum. Logs and metrics matter — you cannot tune what you do not measure.

Model Selection per Step #

Not every step needs the most capable (and most expensive) model. A prompt chain might use a large model for the creative generation step and a small model for the classification, validation, or translation steps. This is cost optimization at the workflow level — you match model capability to step difficulty.

STEP_MODELS = {
    "classify": {"model": "small-model", "cost_per_1k": 0.01},
    "generate": {"model": "large-model", "cost_per_1k": 0.15},
    "review":   {"model": "small-model", "cost_per_1k": 0.01},
    "translate": {"model": "medium-model", "cost_per_1k": 0.05},
}

The savings add up. If 60% of your workflow's model calls are classification or validation steps, using a small model for those steps can cut total cost by 40% or more with negligible quality impact. The table earlier maps patterns to their cost profiles — use it as a starting point for estimating what a composite workflow will cost.

Combining Patterns #

Real systems rarely use a single pattern in isolation. A customer support system might use routing to classify the query, prompt chaining within each handler for multi-step resolution, and parallelization for running a safety check alongside the main response. A coding agent might use an orchestrator-worker pattern to plan file changes, with an evaluator-optimizer loop inside each worker to iterate until the tests pass.

         Customer Query
              │
              ▼
         ┌──────────┐
         │  Router  │
         └────┬─────┘
              │
   ┌──────────┴──┬────────────┐
   ▼             ▼            ▼
Billing      Technical     General
   │             │            │
   ▼             ▼            ▼
 Chain:       Chain:        Chain:
 lookup      diagnose       search
 → check      → fix         → answer
 → refund     → verify

The key to combining patterns without creating a mess is to keep each pattern at a single level of abstraction. The router does not know about the chain inside each handler. The chain does not know it is being invoked by a router. Each component has a clean input-output interface, and the orchestration logic at each level is simple enough to fit in one screen of code.

When composition gets deep — a router that invokes an orchestrator that spawns workers that each run evaluator-optimizer loops — the system becomes hard to debug and expensive to run. A good rule of thumb: if you cannot draw the entire workflow on a whiteboard, it is too complex. Flatten it, merge steps, or accept that some quality loss from simplification is worth the operational clarity.

Conclusion #

Workflow orchestration is about assembling model calls into structures that are more capable than any single call. The patterns are few — chaining, routing, parallelization, orchestration, and refinement loops — but choosing and combining them well is where the engineering challenge lies. The goal is always the same: get better results than a single prompt, without spending more on complexity than you gain in quality.

Key takeaways:

Workflows give the developer control of the execution path; autonomous agents give it to the model — use workflows when the task structure is predictable
Prompt chaining is the default starting point: a fixed sequence of focused model calls with validation gates between them
Routing trades one general-purpose prompt for a classifier plus specialized handlers, which works when distinct input categories benefit from different treatment
Parallelization runs independent subtasks concurrently for speed or runs the same task multiple times for confidence — but cost scales linearly with fan-out
Orchestrator-workers add dynamic planning on top of parallelization, at the cost of plan quality becoming a new failure mode
Evaluator-optimizer loops improve quality through iterative refinement, but with diminishing returns per round — always set a maximum iteration limit
Budget every workflow with token caps, step limits, and per-step model selection to prevent runaway cost
Combine patterns at clean abstraction boundaries — if you cannot draw the full workflow on a whiteboard, simplify

Workflows vs. Autonomous Agents #

Prompt Chaining #

When to use It #

Trade-offs #

Routing #

Model Routing #

When to Use It #

Trade-offs #

Parallelization #

Sectioning #

Voting #

When to Use It #

Trade-offs #

Orchestrator-Workers #

When to Use It #

Trade-offs #

Evaluator-Optimizer #

When to Use It #

Trade-offs #

Practical Considerations #

Choosing a Pattern #

Budgets and Cost Control #

Token Budgets #

Step Limits #

Model Selection per Step #

Combining Patterns #

Conclusion #

Continue exploring