s16

Long-Running Harness

Planning & Coordination

GAN-Style Three Agents

267 LOC5 toolsPlanner/Generator/Evaluator with Fresh Context Window

Separate the generator from the evaluator; fresh context per iteration solves context anxiety

Planner/Generator/Evaluator

Three-Agent Architecture

Planner → Generator → Evaluator. GAN-inspired adversarial roles.

1/6

s13 > s14 > s15 > [ s16 ] s17 > s18 > s19 > s20 > s21

"Separate the generator from the evaluator." -- Adversarial role separation produces higher quality output.

Harness layer: Planning & Coordination -- GAN-inspired three-agent architecture for long-running tasks.

Problem

Agents in long-running tasks suffer from "context anxiety": as context grows, the model gradually loses focus, or prematurely declares "done." A single agent can't simultaneously generate code and objectively evaluate quality — like a writer can't objectively edit their own work.

Solution

                    ┌──────────────┐
                    │   Planner    │  ← Expand request into detailed spec
                    └──────┬───────┘
                           │ spec
                    ┌──────▼───────┐
             ┌─────►│  Generator   │  ← Implement features incrementally
             │      └──────┬───────┘
             │             │ code
         fix │      ┌──────▼───────┐
             │      │  Evaluator   │  ← Test and score using tools
             │      └──────┬───────┘
             │             │
             │    pass?  ──┴── fail?
             │    ✅ next     ↺ retry
             └────────────────┘

Key: Each agent uses a Fresh Context Window — completely independent context.

Core Concepts

GAN-Inspired Triangular Architecture

Inspired by GANs (Generative Adversarial Networks), three agents play adversarial roles:

Agent	GAN Analogy	Responsibility	Context
Planner	Training Objective	Expand requirements into detailed specs + feature list	Independent
Generator	Generator	Incrementally implement code by feature	Independent + Artifact feedback
Evaluator	Discriminator	Test using tools, score quality	Independent (no generation bias)

Key insight: Separating "generation" and "evaluation" into different agents with different contexts is why this architecture outperforms single-agent approaches.

Fresh Context Window

Every agent call starts from empty context. Zero conversation history. This solves "context anxiety": regardless of how long the project runs, each individual agent call is short and focused.

Context transfer happens through structured Artifact files, not conversation history.

Artifact Files (Handoff Documents)

Agents don't communicate directly. They pass context through files in .harness_artifacts/:

.harness_artifacts/
  ├── plan.md              ← Planner output (feature specs)
  ├── code_feature1.md     ← Generator's feature 1 code
  ├── feedback_feature1.md ← Evaluator's feedback and score
  ├── code_feature2.md     ← Generator's feature 2 code
  └── feedback_feature2.md ← Evaluator's feedback and score

Key Code

def long_running_loop(request, max_rounds=3):
    # Phase 1: Planner (fresh context)
    plan = call_agent(PLANNER_SYSTEM, request)
    save_artifact("plan", plan)

    for feature in parse_features(plan):
        for round in range(max_rounds):
            # Phase 2: Generator (fresh context + artifact feedback)
            feedback = load_artifact(f"feedback_{feature}") or ""
            code = call_agent(
                GENERATOR_SYSTEM,
                f"Feature: {feature}\nFeedback: {feedback}",
                tools=CODING_TOOLS
            )
            # Phase 3: Evaluator (fresh context, no generation bias)
            eval_result = call_agent(EVALUATOR_SYSTEM, code, tools=TESTING_TOOLS)
            score = parse_score(eval_result)
            if score >= 7:
                break  # Quality met, next feature
            save_artifact(f"feedback_{feature}", eval_result)

def call_agent(system, prompt, tools=None):
    """Fresh Context Window — every call is independent."""
    messages = [{"role": "user", "content": prompt}]  # No history!
    return client.messages.create(
        model=MODEL, system=system, messages=messages,
        tools=tools, max_tokens=8000,
    ).content[0].text

What's New (s04 → s16)

Aspect	s04 (Subagents)	s16 (Long-Running)
Agent count	Main + sub-agents	Three specialized roles
Context	Sub-agent isolation	Fresh context every call
Quality control	None	Evaluator auto-scoring
Handoff	Function return values	Artifact files
Use case	One-off subtasks	Multi-hour large projects
Design inspiration	Delegation pattern	GAN adversarial training

Deep Dive: Design Decisions

Q1: Why is Fresh Context Window better than inherited context?

In long-running tasks, inherited context has three problems:

Context anxiety: When context approaches the limit, the model tends to rush completion
Attention decay: Early information fades in very long conversations (Lost in the Middle)
Bias accumulation: Failed attempts stay in context, model may repeat failed approaches

Fresh Context Window solution:

Inherited: [req][attempt1 ❌][fix1][attempt2 ❌][fix2]... → model confused
Fresh:     [req][last round feedback] → model sees only the most relevant info

Tradeoff: The model loses "global vision." But Artifact files compensate — the model reads structured documents for necessary cross-step context.

Q2: How should the Evaluator's scoring criteria be calibrated?

Key principle: The Evaluator should be tuned as a "skeptic" — better to wrongly reject than to pass low-quality code. If too strict (Generator never passes), lower the threshold to 6. If too lenient, add specific checklist items: "must have error handling," "must have type hints."

Q3: What if the Generator fails 3 consecutive rounds?

Three strategies by priority: (1) Degrade: Accept best attempt if score ≥ 5. (2) Re-plan: Return to Planner, decompose feature into smaller sub-features. (3) Human intervention: Pause loop, report specific issues to user.

Q4: How should Artifact file structure be designed?

Best practice: Machine-readable + Human-reviewable. Use Markdown with embedded code blocks. Artifacts serve as both agent input AND human audit points.

Q5: How does this three-agent architecture relate to microservices?

Microservice Concept	Three-Agent Architecture
API Gateway	Planner (routes and decomposes requests)
Service Worker	Generator (executes specific tasks)
Health Check	Evaluator (quality verification)
Service Contract	Artifact files (standardized interface)
Retry Policy	max_rounds loop
Circuit Breaker	Degradation strategy

Core parallel: Separation of concerns. Each component does one thing well, communicating through standardized interfaces (Artifacts). Microservice best practices directly apply to multi-agent system design.

Try It

cd learn-claude-code
python agents/s16_long_running_harness.py

Recommended prompts:

"Build a simple calculator with add, subtract, multiply, divide" — watch Planner decompose into features
"Create a user registration system with validation" — trigger multi-round Generator-Evaluator loop
"Build a markdown parser that handles headings and bold text" — see Evaluator scoring and feedback

References

Harness Design for Long-Running Applications — Anthropic, Mar 2026. Details the three-agent architecture (Planner/Generator/Evaluator), including "context anxiety" and the Fresh Context Window solution.
Building Effective Agents — Anthropic, Dec 2025. The Evaluator-Optimizer pattern's original definition; s16 extends this for long-running tasks.
Context Engineering — Anthropic, Sep 2025. Discusses Context Reset strategy — clearing context and transferring state through structured Artifacts.
GAN: Generative Adversarial Networks — Goodfellow et al. Academic inspiration for s16's three-agent architecture — adversarial role separation produces higher quality output.