Loop, Review-and-Critique, and Iterative Refinement
Generation is a single forward pass: the agent reasons, produces output, and the output goes somewhere. Done. That assumption breaks down the moment quality matters. Code that compiles is not the same as code that is correct. A summary that is coherent is not the same as one that is accurate. A plan that sounds reasonable is not the same as one that is safe. Single-pass generation produces plausible output; it does not reliably produce good output.
The review-and-critique pattern introduces a feedback loop: generate output, evaluate it against explicit quality criteria, revise based on that evaluation, and repeat. This is how humans produce quality work too — we draft, review, revise. The same rhythm, applied to agent output.
The Generate-Critique-Revise Loop #
The basic structure has three roles: a generator, a critic, and a loop controller.
The generator produces the initial output and each subsequent revision. The critic evaluates the current output and returns structured feedback. The loop controller decides whether to iterate again or accept the current output. These roles can be played by one agent, two agents, or three — the architecture varies with the use case.
┌─────────────────────┐
│ User Task │
└──────────┬──────────┘
│
▼
┌───────────────┐
│ Generator │◄─────────────┐
│ │ │
│ Produces │ │
│ output v1 │ │
└───────┬───────┘ │
│ │
▼ │ Revised
┌───────────────┐ task + │ output
│ Critic │ critique │
│ │ │
│ Evaluates │ │
│ against │ │
│ criteria │ │
└───────┬───────┘ │
│ │
▼ │
┌───────────────┐ │
│Loop Controller│─── iterate? ─┘
│ │
│ Accept or │──── accept ────► Final Output
│ retry │
└───────────────┘
Each pass through the loop is one iteration. The generator sees the original task plus the critic's feedback from the previous iteration. It revises, carrying forward everything that was already acceptable and fixing what was not. The critic evaluates the revised output fresh, without memory of the previous round. This keeps the critique independent: if the revision introduced a new problem, the critic will catch it regardless of whether the previous version had that problem.
Self-Critique #
The simplest version of this loop uses a single model in two different roles. In one call, the model acts as the generator. In the next, it is shown its own output and asked to evaluate it. In the third, it revises based on the critique. The "two agents" are actually one model with two different system prompts.
GENERATOR_PROMPT = """You are a technical writer. Given a task, produce a
clear, accurate, and concise response. Write for an engineer audience.
"""
CRITIC_PROMPT = """You are a critical reviewer. You will be given:
1. An original task description
2. A draft response
Evaluate the draft on these dimensions:
- Accuracy: are all technical claims correct?
- Completeness: does it cover all required aspects of the task?
- Clarity: is it clear and unambiguous?
- Conciseness: is there unnecessary length or repetition?
For each dimension, rate it as: PASS, WARN, or FAIL.
Then provide specific, actionable suggestions for any WARN or FAIL dimensions.
Return a JSON object with this structure:
{
"scores": {
"accuracy": "PASS|WARN|FAIL",
"completeness": "PASS|WARN|FAIL",
"clarity": "PASS|WARN|FAIL",
"conciseness": "PASS|WARN|FAIL"
},
"suggestions": ["...", "..."],
"approved": true|false
}
Set approved to true only if all dimensions are PASS or WARN with minor issues.
"""
def self_critique_loop(task, model, max_iterations=3):
draft = call_model(
model=model,
system=GENERATOR_PROMPT,
messages=[{"role": "user", "content": task}],
).text
for iteration in range(max_iterations):
critique = call_model(
model=model,
system=CRITIC_PROMPT,
messages=[{
"role": "user",
"content": f"Task:\n{task}\n\nDraft:\n{draft}",
}],
).text
result = parse_json(critique)
if result["approved"] or not result["suggestions"]:
return draft
revision_prompt = (
f"Original task:\n{task}\n\n"
f"Your previous draft:\n{draft}\n\n"
f"Reviewer feedback:\n"
+ "\n".join(f"- {s}" for s in result["suggestions"])
+ "\n\nRevise the draft to address this feedback. "
"Keep everything that was correct. Only change what needs fixing."
)
draft = call_model(
model=model,
system=GENERATOR_PROMPT,
messages=[{"role": "user", "content": revision_prompt}],
).text
return draft
Self-critique is fast and cheap. There is no coordination overhead, no additional model to manage, and the loop runs synchronously. Its weakness is that it shares the generator's blind spots. If the model does not know that a technical claim is wrong, it will not catch it in the critic role either. Self-critique improves form — clarity, completeness, tone — more reliably than it improves substance.
For use cases where the main risks are structural (missing information, poor organization, unclear language), self-critique is often sufficient. For use cases where the main risk is factual accuracy or logical correctness, a separate critic with different context or a different model adds real value.
The Dedicated Critic #
A dedicated critic is a separate agent with its own system prompt and, optionally, its own tools. The generator and critic are independent: they have different roles, different access to information, and possibly different models. The generator optimizes for producing useful output; the critic optimizes for finding problems.
The key advantage is independence. The critic does not share the generator's frame of reference, assumptions, or stylistic preferences. It can catch errors the generator was blind to precisely because it approaches the output fresh. You can also give the critic tools the generator does not have — a static analyzer, a test runner, a fact-check API, a schema validator — so it performs active verification rather than just reasoning about correctness.
class ReviewLoop:
def __init__(self, generator, critic, max_iterations=3):
self.generator = generator
self.critic = critic
self.max_iterations = max_iterations
def run(self, task):
draft = self.generator.run(task)
history = [{"draft": draft, "critique": None}]
for i in range(self.max_iterations):
critique = self.critic.evaluate(task=task, draft=draft)
history[-1]["critique"] = critique
if critique.approved:
return draft, history
revision_task = build_revision_prompt(
original_task=task,
draft=draft,
suggestions=critique.suggestions,
)
draft = self.generator.run(revision_task)
history.append({"draft": draft, "critique": None})
return draft, history
class CodeReviewer:
"""A critic agent specialized for code review."""
def __init__(self, model):
self.model = model
self.tools = [run_linter, run_type_checker, run_tests, check_security]
def evaluate(self, task, draft):
# Run automated checks first
lint_issues = run_linter(draft)
type_errors = run_type_checker(draft)
test_results = run_tests(draft)
security_issues = check_security(draft)
# Ask the model to reason about what the automated checks found
# and add its own code review judgment
response = call_model(
model=self.model,
system=CODE_REVIEWER_PROMPT,
messages=[{
"role": "user",
"content": (
f"Task: {task}\n\n"
f"Code:\n{draft}\n\n"
f"Lint issues: {lint_issues}\n"
f"Type errors: {type_errors}\n"
f"Test results: {test_results}\n"
f"Security issues: {security_issues}"
),
}],
)
return parse_critique(response.text)
The code review loop is the clearest example of where a dedicated critic pays off. A linter catches style issues the LLM might overlook; a type checker catches errors the model cannot reliably detect from static reasoning; a test runner provides ground truth about whether the code actually works. The critic combines machine-checked results with model-level judgment — something the generator alone cannot do.
Structured Critique #
The quality of the revision depends entirely on the quality of the critique. Vague feedback produces vague revisions. "The code could be cleaner" tells the generator nothing actionable. "The process_users function has a nested loop O(n²) — use a set lookup instead" gives the generator exactly what it needs to improve.
Structured critique formats make feedback more actionable and the loop more reliable:
from dataclasses import dataclass, field
from typing import Literal
@dataclass
class CritiqueFinding:
dimension: str # "correctness", "style", etc.
severity: Literal["blocking", "suggestion"] # must-fix vs. nice-to-have
location: str # "line 42", "function foo", etc.
description: str # what is wrong
recommendation: str # how to fix it
@dataclass
class Critique:
findings: list[CritiqueFinding] = field(default_factory=list)
approved: bool = False
@property
def blocking_findings(self):
return [f for f in self.findings if f.severity == "blocking"]
@property
def suggestions(self):
return [f for f in self.findings if f.severity == "suggestion"]
Separating blocking findings from suggestions is important. If the loop only iterates when there are blocking findings, it will not spin endlessly trying to perfect prose. Blocking issues are correctness problems, security vulnerabilities, missing requirements. Suggestions are improvements — the output can ship without them.
This distinction also changes the stopping condition. "No blocking findings" is a cleaner acceptance criterion than "no findings at all." Chasing zero suggestions risks infinite loops and gold-plating.
Stopping Criteria #
The hardest design decision in this pattern is when to stop. Three stopping conditions are worth combining:
Approval signal. The critic explicitly marks the output as approved. This is the happy path — the system exits cleanly when quality is met, potentially before the iteration budget is exhausted.
Iteration budget. A hard cap on the number of iterations. This is the safety valve. When the cap is reached, the system returns the best output seen so far, regardless of whether the critic approved it. Without this, a pathological loop can iterate indefinitely — especially when the critic and generator are stuck in a disagreement cycle where each revision introduces a new problem.
Diminishing returns detection. If successive iterations produce no meaningful improvement — the same findings keep appearing, the quality score stops rising — the system stops iterating early. This saves cost without waiting for the full budget.
def run_with_early_stopping(task, generator, critic, max_iterations=5):
draft = generator.run(task)
previous_findings = set()
for i in range(max_iterations):
critique = critic.evaluate(task=task, draft=draft)
if critique.approved:
return draft, "approved"
current_findings = frozenset(
(f.dimension, f.description) for f in critique.blocking_findings
)
# Diminishing returns: same findings as last round, stop iterating
if i > 0 and current_findings == previous_findings:
return draft, "stalled"
previous_findings = current_findings
if not critique.blocking_findings:
# No blockers — ship with suggestions outstanding
return draft, "acceptable"
draft = generator.run(
build_revision_prompt(task, draft, critique.blocking_findings)
)
return draft, "budget_exhausted"
The "stalled" exit is important. If the critic keeps finding the same issues across iterations, the generator is not making progress — either the issue is beyond the generator's capability, the critique is poorly specified, or the generator and critic have conflicting constraints. Continuing to iterate wastes tokens and produces the same output. Better to surface the issue to a human or fall back to the last acceptable state.
When to Use Iteration #
Not every task benefits from a review loop. The question is whether the quality gain per iteration exceeds the cost per iteration. That ratio varies by task type.
Quality-critical outputs benefit most: legal documents, security-sensitive code, medical information, financial analysis. A single pass is insufficient because the cost of an undetected error is high. The review loop shifts the risk curve.
Open-ended creative tasks benefit when there is a defined standard to check against. "Write a good essay" is hard to critique objectively. "Write an essay that covers these five claims, uses no jargon, and stays under 500 words" can be critiqued mechanically. The more concrete the acceptance criteria, the more useful the loop.
Short, factual, constrained tasks rarely benefit. If the task is "extract the date from this invoice," one pass is enough. Adding a review loop adds latency and cost without improving a result that is either right or wrong on the first try.
The Multi-Agent Review Board #
The two-agent loop (generator + single critic) can be extended to a panel. Multiple critics evaluate the output in parallel, each from a different perspective, and the generator revises against all of their feedback simultaneously.
┌──────────────┐
│ Generator │
└──────┬───────┘
│ draft
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Technical │ │ Security │ │ Style │
│ Reviewer │ │ Reviewer │ │ Reviewer │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┼─────────────────┘
│ merged critique
┌──────▼───────┐
│ Generator │ (revision)
└──────────────┘
Each reviewer checks a different dimension — correctness, security, style — in parallel. The generator receives a merged critique covering all dimensions and revises once, rather than sequentially addressing one dimension at a time. This approach is faster than sequential single-critic rounds and more thorough than a single critic covering all dimensions.
The tradeoff is conflict. When reviewers give contradictory feedback — the technical reviewer wants more detail, the style reviewer wants less length — the generator has to resolve the conflict. You can prioritize reviewers explicitly (technical correctness > style) or add a fourth agent whose job is to merge and prioritize the critiques before they reach the generator.
Failure Modes #
Oscillation. The generator fixes issue A in iteration 2, and the critic flags a new issue B introduced by that fix. In iteration 3, fixing B reintroduces A. The system oscillates between two imperfect states and never converges. The diminishing returns detection described above is the primary defense, but the root cause is usually a critic that applies conflicting constraints. Explicit priority ordering in the critic's prompt helps: fix correctness before style; never sacrifice correctness for brevity.
Critic overfit. The generator learns to satisfy the critic rather than satisfy the original task. If the critic checks for specific surface features — uses bullet points, includes a summary section, references the requirements — the generator can produce output that scores well on those features while being subtly wrong in ways the critic does not check. Auditing the critic's criteria against the actual quality goals is a maintenance task that easy to skip and costly to overlook.
Cost explosion. Each iteration adds two model calls (generate + critique) and potentially tool execution costs. A five-iteration loop for a complex task can cost 10x what a single-pass approach costs. Budget limits are not optional — they are required infrastructure. Monitor cost-per-task in production to catch loops that are routinely hitting their maximums; that signals either the task is too hard for the current setup or the acceptance criteria are too strict.
Hallucinated approval. The critic approves output that does not meet the criteria — either because the criteria were underspecified or because the critic model drifted. Adding an objective acceptance test alongside the model-based critic — running the output through a validator, a test suite, a schema check — provides a ground truth signal that cannot hallucinate.
The Loop as a Quality Gate #
The review-and-critique loop is a quality gate — a checkpoint that output must pass before it leaves the system. Thinking of it that way clarifies the design requirements: what exactly must the output satisfy to pass? Who (or what) decides if it passed? What happens when it does not pass within budget?
Answering those questions forces precision about acceptance criteria that is valuable even if you never implement the loop. If you cannot specify what "good output" means clearly enough for a critic to evaluate it, you probably cannot evaluate it manually either. The discipline of writing a critic prompt is a useful design exercise on its own.
Conclusion #
The review-and-critique loop improves output quality by replacing a single generation pass with a feedback cycle: generate, evaluate, revise, repeat. Key takeaways:
- Self-critique works well for structural improvements — clarity, completeness, tone — but shares the generator's blind spots on substance. Use it when the main risks are form-related.
- A dedicated critic with its own tools and context provides independent evaluation that catches what the generator cannot see. Essential for correctness-critical tasks.
- Structured critique with severity levels (blocking vs. suggestion) makes feedback actionable and keeps the loop focused on what matters.
- Stopping criteria must be explicit: an approval signal, an iteration budget, and diminishing-returns detection. Without all three, the loop is either too eager to stop or too expensive to run.
- Oscillation and cost explosion are the primary failure modes. Both are controlled through conflict resolution in the critic prompt and programmatic budget limits.
- The pattern is worth applying when the cost of an undetected error exceeds the cost of iteration. For constrained, factual tasks, single-pass is usually enough.