Long-Running Harness
Planning & CoordinationGAN-Style Three Agents
Separate the generator from the evaluator; fresh context per iteration solves context anxiety
s13 > s14 > s15 > [ s16 ] s17 > s18 > s19 > s20 > s21
"Separate the generator from the evaluator." -- Adversarial role separation produces higher quality output.
Harness layer: Planning & Coordination -- GAN-inspired three-agent architecture for long-running tasks.
Problem
Agents in long-running tasks suffer from "context anxiety": as context grows, the model gradually loses focus, or prematurely declares "done." A single agent can't simultaneously generate code and objectively evaluate quality — like a writer can't objectively edit their own work.
Solution
┌──────────────┐
│ Planner │ ← Expand request into detailed spec
└──────┬───────┘
│ spec
┌──────▼───────┐
┌─────►│ Generator │ ← Implement features incrementally
│ └──────┬───────┘
│ │ code
fix │ ┌──────▼───────┐
│ │ Evaluator │ ← Test and score using tools
│ └──────┬───────┘
│ │
│ pass? ──┴── fail?
│ ✅ next ↺ retry
└────────────────┘
Key: Each agent uses a Fresh Context Window — completely independent context.
Core Concepts
GAN-Inspired Triangular Architecture
Inspired by GANs (Generative Adversarial Networks), three agents play adversarial roles:
| Agent | GAN Analogy | Responsibility | Context |
|---|---|---|---|
| Planner | Training Objective | Expand requirements into detailed specs + feature list | Independent |
| Generator | Generator | Incrementally implement code by feature | Independent + Artifact feedback |
| Evaluator | Discriminator | Test using tools, score quality | Independent (no generation bias) |
Key insight: Separating "generation" and "evaluation" into different agents with different contexts is why this architecture outperforms single-agent approaches.
Fresh Context Window
Every agent call starts from empty context. Zero conversation history. This solves "context anxiety": regardless of how long the project runs, each individual agent call is short and focused.
Context transfer happens through structured Artifact files, not conversation history.
Artifact Files (Handoff Documents)
Agents don't communicate directly. They pass context through files in .harness_artifacts/:
.harness_artifacts/
├── plan.md ← Planner output (feature specs)
├── code_feature1.md ← Generator's feature 1 code
├── feedback_feature1.md ← Evaluator's feedback and score
├── code_feature2.md ← Generator's feature 2 code
└── feedback_feature2.md ← Evaluator's feedback and score
Key Code
def long_running_loop(request, max_rounds=3):
# Phase 1: Planner (fresh context)
plan = call_agent(PLANNER_SYSTEM, request)
save_artifact("plan", plan)
for feature in parse_features(plan):
for round in range(max_rounds):
# Phase 2: Generator (fresh context + artifact feedback)
feedback = load_artifact(f"feedback_{feature}") or ""
code = call_agent(
GENERATOR_SYSTEM,
f"Feature: {feature}\nFeedback: {feedback}",
tools=CODING_TOOLS
)
# Phase 3: Evaluator (fresh context, no generation bias)
eval_result = call_agent(EVALUATOR_SYSTEM, code, tools=TESTING_TOOLS)
score = parse_score(eval_result)
if score >= 7:
break # Quality met, next feature
save_artifact(f"feedback_{feature}", eval_result)
def call_agent(system, prompt, tools=None):
"""Fresh Context Window — every call is independent."""
messages = [{"role": "user", "content": prompt}] # No history!
return client.messages.create(
model=MODEL, system=system, messages=messages,
tools=tools, max_tokens=8000,
).content[0].text
What's New (s04 → s16)
| Aspect | s04 (Subagents) | s16 (Long-Running) |
|---|---|---|
| Agent count | Main + sub-agents | Three specialized roles |
| Context | Sub-agent isolation | Fresh context every call |
| Quality control | None | Evaluator auto-scoring |
| Handoff | Function return values | Artifact files |
| Use case | One-off subtasks | Multi-hour large projects |
| Design inspiration | Delegation pattern | GAN adversarial training |
Deep Dive: Design Decisions
Q1: Why is Fresh Context Window better than inherited context?
In long-running tasks, inherited context has three problems:
- Context anxiety: When context approaches the limit, the model tends to rush completion
- Attention decay: Early information fades in very long conversations (Lost in the Middle)
- Bias accumulation: Failed attempts stay in context, model may repeat failed approaches
Fresh Context Window solution:
Inherited: [req][attempt1 ❌][fix1][attempt2 ❌][fix2]... → model confused
Fresh: [req][last round feedback] → model sees only the most relevant info
Tradeoff: The model loses "global vision." But Artifact files compensate — the model reads structured documents for necessary cross-step context.
Q2: How should the Evaluator's scoring criteria be calibrated?
Key principle: The Evaluator should be tuned as a "skeptic" — better to wrongly reject than to pass low-quality code. If too strict (Generator never passes), lower the threshold to 6. If too lenient, add specific checklist items: "must have error handling," "must have type hints."
Q3: What if the Generator fails 3 consecutive rounds?
Three strategies by priority: (1) Degrade: Accept best attempt if score ≥ 5. (2) Re-plan: Return to Planner, decompose feature into smaller sub-features. (3) Human intervention: Pause loop, report specific issues to user.
Q4: How should Artifact file structure be designed?
Best practice: Machine-readable + Human-reviewable. Use Markdown with embedded code blocks. Artifacts serve as both agent input AND human audit points.
Q5: How does this three-agent architecture relate to microservices?
| Microservice Concept | Three-Agent Architecture |
|---|---|
| API Gateway | Planner (routes and decomposes requests) |
| Service Worker | Generator (executes specific tasks) |
| Health Check | Evaluator (quality verification) |
| Service Contract | Artifact files (standardized interface) |
| Retry Policy | max_rounds loop |
| Circuit Breaker | Degradation strategy |
Core parallel: Separation of concerns. Each component does one thing well, communicating through standardized interfaces (Artifacts). Microservice best practices directly apply to multi-agent system design.
Try It
cd learn-claude-code
python agents/s16_long_running_harness.py
Recommended prompts:
"Build a simple calculator with add, subtract, multiply, divide"— watch Planner decompose into features"Create a user registration system with validation"— trigger multi-round Generator-Evaluator loop"Build a markdown parser that handles headings and bold text"— see Evaluator scoring and feedback
References
- Harness Design for Long-Running Applications — Anthropic, Mar 2026. Details the three-agent architecture (Planner/Generator/Evaluator), including "context anxiety" and the Fresh Context Window solution.
- Building Effective Agents — Anthropic, Dec 2025. The Evaluator-Optimizer pattern's original definition; s16 extends this for long-running tasks.
- Context Engineering — Anthropic, Sep 2025. Discusses Context Reset strategy — clearing context and transferring state through structured Artifacts.
- GAN: Generative Adversarial Networks — Goodfellow et al. Academic inspiration for s16's three-agent architecture — adversarial role separation produces higher quality output.