Reasoning Beyond ReAct
ReAct gives an agent the ability to think one step at a time, take an action, observe the result, and decide what to do next. That loop handles a remarkable range of tasks. But it has a core limitation: it produces a single chain of reasoning. One thought leads to one action leads to one observation. If the model takes a wrong turn at step two, it has no mechanism to course-correct except to notice the mistake later and try to recover — still inside the same linear trace.
For harder problems — multi-step math, ambiguous questions, complex code generation, strategic decisions with many possible paths — a single chain is not enough. The reasoning itself needs structure. Branching. Backtracking. Self-evaluation. Multiple attempts that get compared and filtered.
This is the territory beyond ReAct: techniques that invest more compute into the reasoning phase to produce more reliable outputs. They do not replace ReAct — they augment or wrap it, making the agent's thinking deeper before it commits to an action.
Chain-of-Thought as a Foundation #
Before we go further, it helps to be precise about the baseline. Chain-of-thought (CoT) prompting asks the model to reason step by step before giving a final answer. You can trigger it with a simple instruction like "think step by step" or by providing a few-shot example that demonstrates intermediate reasoning.
def chain_of_thought(question: str) -> str:
prompt = f"""Answer the following question.
Think step by step before giving your final answer.
Question: {question}
"""
response = call_model(prompt)
return response.text
CoT is powerful — it can turn a model that fails at arithmetic into one that gets the right answer by showing its work. But it has three weaknesses that matter for agents:
Single path. The model produces one reasoning chain. If that chain has a flaw — a wrong assumption, an arithmetic error, a logical gap — the final answer inherits the flaw. There is no redundancy.
No self-evaluation. The model does not look back at its own reasoning to check for errors. It generates the chain left-to-right and commits.
No exploration. The model cannot try multiple approaches and pick the best one. It is locked into whatever direction the first few tokens push it toward.
The techniques that follow each address one or more of these weaknesses.
Self-Consistency: Voting Across Paths #
The simplest extension of chain-of-thought is to run it multiple times and take a vote. This is self-consistency — the insight being that if you sample several independent reasoning chains for the same question, the correct answer tends to appear more often than any particular wrong answer.
Question
/ | \
/ | \
Chain A Chain B Chain C
ans: 42 ans: 42 ans: 37
\ | /
\ | /
Majority Vote
→ 42
The implementation is straightforward:
import collections
def self_consistency(question: str, n_samples: int = 5, temperature: float = 0.7) -> str:
answers = []
for _ in range(n_samples):
response = call_model(
build_cot_prompt(question),
temperature=temperature,
)
answer = extract_final_answer(response.text)
answers.append(answer)
counter = collections.Counter(answers)
best_answer, count = counter.most_common(1)[0]
confidence = count / n_samples
return best_answer
You need a non-zero temperature — otherwise every sample produces the same chain and voting is pointless. The sweet spot is usually 0.5–0.8: enough diversity to explore different paths, not so much that the chains become incoherent.
When to use it: Problems with a clear final answer (math, factual questions, classification) where you want higher reliability without changing the prompt. It works poorly for open-ended generation where there is no single "correct" answer to converge on.
Trade-offs: You pay N times the compute for N samples. Latency increases unless you run samples in parallel. But the accuracy gain is often worth it — self-consistency can close the gap between a 70% success rate and a 90% success rate for reasoning-heavy tasks.
In an agent context: You can wrap self-consistency around specific decisions within a ReAct loop. Before the model commits to a tool call, sample the reasoning step multiple times and only proceed if the majority agrees on which tool to call and with what arguments. This is especially valuable for high-stakes actions.
def consensus_action(state: dict, tools: list, n_samples: int = 3) -> dict:
proposals = []
for _ in range(n_samples):
response = call_model(
build_react_prompt(state, tools),
temperature=0.6,
)
action = parse_tool_call(response)
proposals.append(action)
# Group by tool name + arguments
groups = collections.Counter(
(p["tool"], freeze(p["arguments"])) for p in proposals
)
best, count = groups.most_common(1)[0]
if count < n_samples // 2 + 1:
# No consensus — fall back to greedy (temperature=0) call
response = call_model(build_react_prompt(state, tools), temperature=0.0)
return parse_tool_call(response)
return {"tool": best[0], "arguments": unfreeze(best[1])}
Tree-of-Thought: Branching and Backtracking #
Self-consistency runs multiple chains independently — they never interact. Tree-of-Thought (ToT) goes further. It structures reasoning as a tree: the model generates multiple possible next steps at each node, evaluates them, and only expands the most promising branches. Bad branches get pruned. The model can backtrack.
Problem
/ \
Step A1 Step A2
/ \ |
Step B1 Step B2 Step B3
| ✗ |
Step C1 Step C2
| |
Answer Answer
↑ ↑
Score: 8 Score: 6
✓
This mirrors how a human expert solves a hard problem: consider a few approaches, mentally evaluate each one, pursue the most promising, and back up if you hit a dead end.
def tree_of_thought(problem: str, breadth: int = 3, depth: int = 4) -> str:
root = {"state": problem, "steps": [], "score": None}
frontier = [root]
for level in range(depth):
candidates = []
for node in frontier:
# Generate multiple possible next steps
next_steps = generate_next_steps(node, n=breadth)
for step in next_steps:
child = {
"state": node["state"] + "\n" + step,
"steps": node["steps"] + [step],
"score": None,
}
# Evaluate how promising this path looks
child["score"] = evaluate_state(child["state"])
candidates.append(child)
# Keep only the top-k most promising branches
candidates.sort(key=lambda c: c["score"], reverse=True)
frontier = candidates[:breadth]
# Return the best-scoring leaf
best = max(frontier, key=lambda c: c["score"])
return extract_answer(best["state"])
def generate_next_steps(node: dict, n: int) -> list[str]:
prompt = f"""Given this problem and partial solution, propose {n} different
possible next steps. Return each step on a separate line.
Problem and progress so far:
{node['state']}
"""
response = call_model(prompt, temperature=0.8)
return response.text.strip().split("\n")[:n]
def evaluate_state(state: str) -> float:
prompt = f"""Evaluate how promising this partial solution is.
Rate it from 0 to 10, where 10 means it is almost certainly on the right track.
Return only the number.
{state}
"""
response = call_model(prompt, temperature=0.0)
return float(response.text.strip())
When to use it: Problems where you can evaluate partial progress — puzzles, planning problems, multi-step reasoning where you can tell midway whether you are on a good track. ToT shines when the search space is large but you can prune aggressively.
Trade-offs: ToT is expensive. At each level, you generate breadth candidates for each node in the frontier, and evaluate every one. For a tree with breadth 3 and depth 4, you might make 40+ model calls for a single question. Latency is significant even with parallelism. The evaluation function is also critical — if the model cannot accurately judge partial solutions, pruning becomes random and the tree degrades to expensive random search.
In an agent context: ToT is most useful for the planning phase. When an agent needs to decide on a strategy (which tools to use, in what order, for what sub-goals), you can explore multiple plans as branches, evaluate each one, and commit to the best. Once committed, execution can proceed with standard ReAct.
Reflection and Self-Critique #
Instead of exploring multiple paths in parallel, reflection adds a sequential self-check: the model generates a response, then evaluates its own output, identifies errors or weaknesses, and tries again.
┌────────────┐
│ Generate │
│ Response │
└─────┬──────┘
│
▼
┌────────────┐ ┌────────────┐
│ Critique │────▶ │ Revise │
│ (self or │ │ Response │
│ other) │ └─────┬──────┘
└────────────┘ │
▲ │
└───────────────────┘
(repeat until satisfied
or iteration limit)
Models are often better at evaluating text than generating it. A model that writes mediocre code on the first try can frequently spot the bugs when asked to review its own output — and fix them on a second pass.
def reflexion(task: str, max_iterations: int = 3) -> str:
response = generate_initial(task)
for i in range(max_iterations):
critique = self_critique(task, response)
if critique["passed"]:
return response
response = revise(task, response, critique["feedback"])
return response
def self_critique(task: str, response: str) -> dict:
prompt = f"""You are reviewing a response to the following task.
Identify any errors, logical gaps, or areas for improvement.
If the response is correct and complete, say PASS.
Otherwise, provide specific feedback.
Task: {task}
Response to review:
{response}
"""
result = call_model(prompt, temperature=0.0)
if result.text.strip().upper().startswith("PASS"):
return {"passed": True, "feedback": None}
return {"passed": False, "feedback": result.text}
def revise(task: str, previous: str, feedback: str) -> str:
prompt = f"""Revise the following response based on the feedback.
Fix the identified issues while preserving what was correct.
Task: {task}
Previous response:
{previous}
Feedback:
{feedback}
Revised response:
"""
return call_model(prompt, temperature=0.0).text
This pattern appears under several names — Reflexion, self-refine, iterative critique — but the core loop is always the same: generate, evaluate, revise.
Variants:
- Same-model reflection: The same model generates and critiques. Simple to implement, but the model may have blind spots it cannot see in its own output.
- Cross-model reflection: A different model (or the same model with a different system prompt) acts as the critic. This breaks the blind-spot problem but doubles the infrastructure.
- Tool-grounded reflection: The critique step uses tools — running tests, checking facts against a database, validating against a schema. This gives the critique concrete evidence rather than relying on the model's judgment alone.
In an agent context: Reflection fits naturally into the coding-agent pattern. Generate code, run tests, read failures, fix the code. But it also works for any agent output that can be validated: SQL queries (run them and check for errors), API calls (dry-run and validate the response schema), research summaries (check claims against retrieved sources).
Extended Thinking and Hidden Scratchpads #
Some model providers offer extended thinking — a mode where the model performs a long internal chain of reasoning before producing its visible response. The reasoning is hidden from the user (and sometimes from the application) but influences the final output.
From an architectural standpoint, this is chain-of-thought baked into inference rather than prompted. The model allocates extra compute budget to the reasoning phase, exploring and self-correcting internally before committing to an answer.
def extended_thinking_call(question: str, thinking_budget: int = 10000) -> str:
response = call_model(
prompt=question,
thinking={
"type": "enabled",
"budget_tokens": thinking_budget,
},
)
# response.thinking contains the internal reasoning (if exposed)
# response.text contains the final answer
return response.text
The trade-off is pure economics: you are paying for the thinking tokens even though the user never sees them. A 10,000-token thinking budget on a complex question might cost 5–10x what a direct answer would cost. But for hard reasoning tasks — math proofs, complex code, multi-constraint planning — the accuracy gain is dramatic.
When to use it in agents:
- As a drop-in upgrade for the reasoning step in ReAct. Replace your standard model call with an extended-thinking call for steps where reasoning quality matters most (the planning step, high-stakes decisions, final synthesis).
- For the evaluation step in tree-of-thought or reflection. A model that thinks longer will produce better critiques.
- When you want the benefits of chain-of-thought without bloating the visible context window. The thinking tokens are consumed but do not appear in the conversation history that gets fed back on the next turn.
Putting It Together: A Reasoning Strategy #
These techniques are not mutually exclusive. A well-designed agent can layer them:
┌──────────────────────────────────────────────────┐
│ Agent Loop │
│ │
│ ┌───────────────────────────────────────────┐ │
│ │ Planning Step │ │
│ │ • Tree-of-Thought to explore strategies │ │
│ │ • Extended thinking for evaluation │ │
│ └───────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────┐ │
│ │ Execution Steps (ReAct loop) │ │
│ │ • Self-consistency for critical actions │ │
│ │ • Standard CoT for routine steps │ │
│ └───────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────┐ │
│ │ Output Validation │ │
│ │ • Reflection with tool-grounded checks │ │
│ │ • Revise if critique fails │ │
│ └───────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
The guiding principle: spend reasoning compute where it matters most. Not every step in an agent loop needs tree search or majority voting. Most tool calls are straightforward — "fetch this URL," "read this file." The expensive reasoning techniques should be reserved for:
- Decisions that are hard to reverse (deleting data, sending messages, committing code)
- Planning steps where a bad strategy wastes many downstream actions
- Synthesis steps where the final output quality determines success or failure
- Ambiguous situations where the model's confidence is low
A practical implementation might use a confidence router: if the model's logprobs on a decision are high (it is very confident), proceed with a single greedy sample. If confidence is low, escalate to self-consistency or extended thinking.
def adaptive_reasoning(state: dict, tools: list) -> dict:
# First try: standard greedy call
response = call_model(
build_react_prompt(state, tools),
temperature=0.0,
return_logprobs=True,
)
action = parse_tool_call(response)
confidence = min_logprob(response, action)
if confidence > CONFIDENCE_THRESHOLD:
return action
# Low confidence: escalate to self-consistency
return consensus_action(state, tools, n_samples=5)
Conclusion #
ReAct's single reasoning chain is the starting point, not the ceiling. When problems are hard enough that a single pass produces unreliable results, you have a toolkit of techniques to invest more compute in the reasoning phase.
Key takeaways:
- Self-consistency runs multiple reasoning chains and takes a majority vote — simple, parallelizable, and effective for problems with clear answers
- Tree-of-Thought structures reasoning as a search tree with branching, evaluation, and pruning — powerful for planning but expensive
- Reflection adds a generate-critique-revise loop that catches errors the first pass missed — especially effective when critiques can be grounded in tool output
- Extended thinking allocates hidden compute budget to internal reasoning — a drop-in upgrade that trades cost for accuracy
- These techniques compose: use tree search for planning, self-consistency for critical decisions, reflection for output validation, and standard CoT for routine steps
- The key design decision is where to spend extra reasoning compute — route it to high-stakes, low-confidence, or hard-to-reverse decisions rather than applying it uniformly