Learn Claude Code
s16

Long-Running Harness

Planning & Coordination

GAN-Style Three Agents

267 LOC5 toolsPlanner/Generator/Evaluator with Fresh Context Window
Separate the generator from the evaluator; fresh context per iteration solves context anxiety

s13 > s14 > s15 > [ s16 ] s17 > s18 > s19 > s20 > s21

"Separate the generator from the evaluator." -- Adversarial role separation produces higher quality output.

Harness layer: Planning & Coordination -- GAN-inspired three-agent architecture for long-running tasks.

Problem

Agents in long-running tasks suffer from "context anxiety": as context grows, the model gradually loses focus, or prematurely declares "done." A single agent can't simultaneously generate code and objectively evaluate quality — like a writer can't objectively edit their own work.

Solution

                    ┌──────────────┐
                    │   Planner    │  ← Expand request into detailed spec
                    └──────┬───────┘
                           │ spec
                    ┌──────▼───────┐
             ┌─────►│  Generator   │  ← Implement features incrementally
             │      └──────┬───────┘
             │             │ code
         fix │      ┌──────▼───────┐
             │      │  Evaluator   │  ← Test and score using tools
             │      └──────┬───────┘
             │             │
             │    pass?  ──┴── fail?
             │    ✅ next     ↺ retry
             └────────────────┘

Key: Each agent uses a Fresh Context Window — completely independent context.

Core Concepts

GAN-Inspired Triangular Architecture

Inspired by GANs (Generative Adversarial Networks), three agents play adversarial roles:

AgentGAN AnalogyResponsibilityContext
PlannerTraining ObjectiveExpand requirements into detailed specs + feature listIndependent
GeneratorGeneratorIncrementally implement code by featureIndependent + Artifact feedback
EvaluatorDiscriminatorTest using tools, score qualityIndependent (no generation bias)

Key insight: Separating "generation" and "evaluation" into different agents with different contexts is why this architecture outperforms single-agent approaches.

Fresh Context Window

Every agent call starts from empty context. Zero conversation history. This solves "context anxiety": regardless of how long the project runs, each individual agent call is short and focused.

Context transfer happens through structured Artifact files, not conversation history.

Artifact Files (Handoff Documents)

Agents don't communicate directly. They pass context through files in .harness_artifacts/:

.harness_artifacts/
  ├── plan.md              ← Planner output (feature specs)
  ├── code_feature1.md     ← Generator's feature 1 code
  ├── feedback_feature1.md ← Evaluator's feedback and score
  ├── code_feature2.md     ← Generator's feature 2 code
  └── feedback_feature2.md ← Evaluator's feedback and score

Key Code

def long_running_loop(request, max_rounds=3):
    # Phase 1: Planner (fresh context)
    plan = call_agent(PLANNER_SYSTEM, request)
    save_artifact("plan", plan)

    for feature in parse_features(plan):
        for round in range(max_rounds):
            # Phase 2: Generator (fresh context + artifact feedback)
            feedback = load_artifact(f"feedback_{feature}") or ""
            code = call_agent(
                GENERATOR_SYSTEM,
                f"Feature: {feature}\nFeedback: {feedback}",
                tools=CODING_TOOLS
            )
            # Phase 3: Evaluator (fresh context, no generation bias)
            eval_result = call_agent(EVALUATOR_SYSTEM, code, tools=TESTING_TOOLS)
            score = parse_score(eval_result)
            if score >= 7:
                break  # Quality met, next feature
            save_artifact(f"feedback_{feature}", eval_result)

def call_agent(system, prompt, tools=None):
    """Fresh Context Window — every call is independent."""
    messages = [{"role": "user", "content": prompt}]  # No history!
    return client.messages.create(
        model=MODEL, system=system, messages=messages,
        tools=tools, max_tokens=8000,
    ).content[0].text

What's New (s04 → s16)

Aspects04 (Subagents)s16 (Long-Running)
Agent countMain + sub-agentsThree specialized roles
ContextSub-agent isolationFresh context every call
Quality controlNoneEvaluator auto-scoring
HandoffFunction return valuesArtifact files
Use caseOne-off subtasksMulti-hour large projects
Design inspirationDelegation patternGAN adversarial training

Deep Dive: Design Decisions

Q1: Why is Fresh Context Window better than inherited context?

In long-running tasks, inherited context has three problems:

  1. Context anxiety: When context approaches the limit, the model tends to rush completion
  2. Attention decay: Early information fades in very long conversations (Lost in the Middle)
  3. Bias accumulation: Failed attempts stay in context, model may repeat failed approaches

Fresh Context Window solution:

Inherited: [req][attempt1 ❌][fix1][attempt2 ❌][fix2]... → model confused
Fresh:     [req][last round feedback] → model sees only the most relevant info

Tradeoff: The model loses "global vision." But Artifact files compensate — the model reads structured documents for necessary cross-step context.

Q2: How should the Evaluator's scoring criteria be calibrated?

Key principle: The Evaluator should be tuned as a "skeptic" — better to wrongly reject than to pass low-quality code. If too strict (Generator never passes), lower the threshold to 6. If too lenient, add specific checklist items: "must have error handling," "must have type hints."

Q3: What if the Generator fails 3 consecutive rounds?

Three strategies by priority: (1) Degrade: Accept best attempt if score ≥ 5. (2) Re-plan: Return to Planner, decompose feature into smaller sub-features. (3) Human intervention: Pause loop, report specific issues to user.

Q4: How should Artifact file structure be designed?

Best practice: Machine-readable + Human-reviewable. Use Markdown with embedded code blocks. Artifacts serve as both agent input AND human audit points.

Q5: How does this three-agent architecture relate to microservices?

Microservice ConceptThree-Agent Architecture
API GatewayPlanner (routes and decomposes requests)
Service WorkerGenerator (executes specific tasks)
Health CheckEvaluator (quality verification)
Service ContractArtifact files (standardized interface)
Retry Policymax_rounds loop
Circuit BreakerDegradation strategy

Core parallel: Separation of concerns. Each component does one thing well, communicating through standardized interfaces (Artifacts). Microservice best practices directly apply to multi-agent system design.

Try It

cd learn-claude-code
python agents/s16_long_running_harness.py

Recommended prompts:

  • "Build a simple calculator with add, subtract, multiply, divide" — watch Planner decompose into features
  • "Create a user registration system with validation" — trigger multi-round Generator-Evaluator loop
  • "Build a markdown parser that handles headings and bold text" — see Evaluator scoring and feedback

References

  • Harness Design for Long-Running Applications — Anthropic, Mar 2026. Details the three-agent architecture (Planner/Generator/Evaluator), including "context anxiety" and the Fresh Context Window solution.
  • Building Effective Agents — Anthropic, Dec 2025. The Evaluator-Optimizer pattern's original definition; s16 extends this for long-running tasks.
  • Context Engineering — Anthropic, Sep 2025. Discusses Context Reset strategy — clearing context and transferring state through structured Artifacts.
  • GAN: Generative Adversarial Networks — Goodfellow et al. Academic inspiration for s16's three-agent architecture — adversarial role separation produces higher quality output.