Layer 1: Surface
Given the goal âresearch competitors and draft a comparison report,â an agent without planning will attempt to do everything in one loop: often calling tools in a confused order, losing track of sub-goals, and producing incomplete output.
Planning breaks the goal into a sequenced task graph before acting. The agent answers âwhat needs to happen?â before asking âhow do I do the next thing?â
Three planning patterns, from simplest to most structured:
| Pattern | How it works | Use when |
|---|---|---|
| ReAct | Interleave reasoning with action: think, then act, then observe | Tasks where each step informs the next |
| Plan-then-execute | Generate a full plan upfront, then execute each step | Tasks with known structure and stable sub-goals |
| Dynamic replanning | Execute with a plan, revise the plan when observations contradict it | Long tasks where reality diverges from the plan |
Layer 2: Guided
ReAct: reason, act, observe
ReAct is not a library: it is a prompt pattern. Before each action, the model explicitly states its current reasoning:
REACT_SYSTEM = """You are a research agent. For each step, follow this format exactly:
Thought: [your reasoning about what to do next and why]
Action: [the tool to call]
Observation: [you will see the result here]
Continue until you have enough information to answer, then respond directly."""
def run_react_agent(goal: str, tools: list[dict]) -> str:
messages = [{"role": "user", "content": goal}]
for _ in range(12):
response = llm.chat(
model="balanced",
system=REACT_SYSTEM,
messages=messages,
tools=tools,
)
if response.stop_reason == "end_turn":
return response.text
messages.append({"role": "assistant", "content": response.content})
results = execute_tools(response.tool_calls)
messages.append({"role": "user", "content": results})
The âThought:â prefix does two things: it forces the model to articulate its reasoning before acting (reducing impulsive tool calls), and it makes the agentâs decision process visible in logs.
Plan-then-execute
For tasks with a predictable structure, generate a plan first, then execute each step independently:
def plan_then_execute(goal: str, tools: list[dict]) -> str:
# Step 1: Generate a structured plan
plan_response = llm.chat(
model="balanced",
messages=[{
"role": "user",
"content": f"""Break this goal into ordered steps. Be specific.
Output as a numbered list. Each step should be independently executable.
Goal: {goal}"""
}]
)
steps = parse_plan(plan_response.text)
# Step 2: Execute each step
results = {}
for i, step in enumerate(steps):
context = format_prior_results(results)
response = llm.chat(
model="balanced",
messages=[{
"role": "user",
"content": f"Prior results:\n{context}\n\nExecute this step: {step}"
}],
tools=tools,
)
results[i] = run_tool_loop_for_step(response, tools)
# Step 3: Synthesise
synthesis = llm.chat(
model="balanced",
messages=[{
"role": "user",
"content": f"Goal: {goal}\n\nStep results:\n{format_results(results)}\n\nWrite the final output."
}]
)
return synthesis.text
Plan-then-execute works well when sub-steps are independent and can run in parallel:
import asyncio
async def parallel_plan_execute(goal: str, tools: list[dict]) -> str:
steps = await generate_plan_async(goal)
# Identify which steps are independent
independent_steps = [s for s in steps if not s.get("depends_on")]
dependent_steps = [s for s in steps if s.get("depends_on")]
# Run independent steps in parallel
parallel_results = await asyncio.gather(*[
execute_step_async(step, tools) for step in independent_steps
])
results = dict(zip([s["id"] for s in independent_steps], parallel_results))
# Run dependent steps sequentially with prior results
for step in dependent_steps:
dep_results = {d: results[d] for d in step["depends_on"]}
results[step["id"]] = await execute_step_async(step, tools, context=dep_results)
return await synthesise_async(goal, results)
Dynamic replanning
When observations invalidate the current plan, regenerate the relevant portion:
def should_replan(step_result: str, current_plan: list[str], step_index: int) -> bool:
"""Ask the model whether this result requires changing the remaining plan."""
remaining = current_plan[step_index + 1:]
if not remaining:
return False
response = llm.chat(
model="fast",
messages=[{
"role": "user",
"content": f"""Step result: {step_result}
Remaining planned steps:
{chr(10).join(f'{i+1}. {s}' for i, s in enumerate(remaining))}
Does this result make any of the remaining steps unnecessary, impossible, or wrong?
Answer YES or NO, then briefly explain."""
}]
)
return response.text.strip().upper().startswith("YES")
def dynamic_plan_execute(goal: str, tools: list[dict]) -> str:
plan = generate_plan(goal)
results = {}
for i, step in enumerate(plan):
results[i] = execute_step(step, tools, context=results)
if i < len(plan) - 1 and should_replan(results[i], plan, i):
# Regenerate remaining steps based on what we now know
plan = plan[:i+1] + regenerate_remaining(goal, plan[:i+1], results)
return synthesise(goal, results)
Replanning is expensive: one extra LLM call per step. Use it only when the task is long and the risk of executing an invalidated plan is high.
When not to plan
PLANNING_THRESHOLD = {
"step_count": 3, # Don't plan if the task takes fewer than 3 steps
"task_duration": 30, # Don't plan if the task takes under 30 seconds anyway
"known_structure": False # Don't plan if you can write the steps in code
}
def needs_planning(task: dict) -> bool:
if task["estimated_steps"] < PLANNING_THRESHOLD["step_count"]:
return False
if task["structure_known"]: # Chain is more reliable than plan
return False
if task["goal_ambiguous"]: # Clarify the goal before planning
return False
return True
Layer 3: Deep Dive
Task decomposition quality
The quality of a plan depends on the decomposition. Bad decompositions produce plans where steps:
- Are too coarse (a single step does too much)
- Are interdependent in ways the model doesnât track
- Include steps that donât contribute to the goal
- Miss prerequisite steps
A simple evaluation of a generated plan:
def evaluate_plan_quality(goal: str, plan: list[str]) -> dict:
response = llm.chat(
model="frontier",
messages=[{
"role": "user",
"content": f"""Evaluate this plan for the goal: "{goal}"
Plan:
{chr(10).join(f'{i+1}. {s}' for i, s in enumerate(plan))}
Rate each of these (1-5):
1. Coverage: does the plan cover all aspects of the goal?
2. Atomicity: are steps small enough to execute independently?
3. Ordering: are steps in the right sequence?
4. Redundancy: are any steps unnecessary?
Output as JSON: {{"coverage": N, "atomicity": N, "ordering": N, "redundancy": N, "issues": ["..."]}}"""
}]
)
return parse_json(response.text)
Run this during development to identify systematic decomposition failures before deploying.
Hierarchical planning
For complex tasks, a two-level hierarchy reduces plan length and improves focus:
High-level plan: [Research, Draft, Review, Finalise]
â
ââ Low-level plan for "Research":
[Search competitors, Extract pricing, Find feature lists, Summarise findings]
The high-level plan stays stable; individual sub-plans can be regenerated if they fail without replanning the entire task.
Further reading
- ReAct: Synergizing Reasoning and Acting in Language Models; Yao et al., 2022. Original ReAct paper; the reason-act-observe loop and its evaluation on multi-hop QA and decision tasks.
- Plan-and-Solve Prompting; Wang et al., 2023. Systematic comparison of plan-first vs react-style execution; useful data on when planning improves vs degrades performance.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models; Yao et al., 2023. Extension of planning to tree-structured exploration; relevant when a task has branching solution paths.