Learn Claude Code
s13

Agent Evals

Observability & Evaluation

TDAD + LLM-as-Judge

256 LOC5 toolsrun_tests + evaluate tools with structured results
An agent that can't verify its own work is just guessing; verifiable output is the key to success

s01 > s02 > s03 > s04 > s05 > s06 | s07 > s08 > s09 > s10 > s11 > s12 | [ s13 ] s14 > s15 > s16 > s17 > s18 > s19 > s20 > s21

"An agent that can't verify its own work is just guessing." -- Verifiable output is the key to agent success.

Harness layer: Observability & Evaluation -- How agents know they got it right.

Problem

An agent writes code, but how does it know the code is correct? Previously we relied on human review, but this is slow and doesn't scale. Agents can verify their own work: write tests, run tests, analyze failures, fix bugs, retry. The key is giving the agent a programmable feedback loop rather than relying on the model's "gut feeling."

Solution

            ┌─────────────┐
            │  📋 Request  │  ← User states requirement
            └──────┬──────┘
                   │
            ┌──────▼──────┐
            │ 📝 Write    │  ← TDAD: Write tests first, define "correct"
            │    Tests    │
            └──────┬──────┘
                   │
            ┌──────▼──────┐
            │ 💻 Implement │  ← Agent writes code to satisfy tests
            │    Code     │
            └──────┬──────┘
                   │
            ┌──────▼──────┐
      ┌────►│ ▶ Run Tests │  ← Run tests, get structured feedback
      │     └──────┬──────┘
      │            │
      │    pass? ──┴── fail?
      │    ✅           ↓
      │          ┌──────────┐
      └──────────│ 🔧 Fix   │  ← Fix based on test output (not guessing)
                 │ & Retry  │
                 └──────────┘
                      │
               All pass ↓
            ┌─────────────┐
            │ 🧑‍⚖️ LLM Judge │  ← Fresh context, unbiased quality eval
            └─────────────┘

Two complementary evaluation modes:

ModePrincipleBest For
TDAD (Test-Driven Agentic Development)Agent writes tests first, tests are ground truthQuantifiable correctness (functions, APIs, data)
LLM-as-JudgeIndependent LLM call evaluates another LLM's outputSubjective quality (style, architecture, readability)

How It Works

Step 1: Write Tests (TDAD Core)

The agent generates test cases from the requirement, defining what "correct" means — this is the fundamental difference from traditional coding agents:

# Agent-generated test file: test_solution.py
def test_basic_addition():
    assert add(2, 3) == 5

def test_negative_numbers():
    assert add(-1, -2) == -3

def test_type_error():
    with pytest.raises(TypeError):
        add("a", 1)

Step 2: Implement Code

The agent writes code to pass all tests:

def add(a: int, b: int) -> int:
    if not isinstance(a, (int, float)) or not isinstance(b, (int, float)):
        raise TypeError("Arguments must be numbers")
    return a + b

Step 3: Run Tests → Get Structured Feedback

The run_tests tool returns structured JSON, not raw text logs:

def run_tests(test_file: str) -> str:
    result = subprocess.run(
        ["python", "-m", "pytest", test_file, "-v", "--tb=short"],
        capture_output=True, text=True
    )
    passed = result.returncode == 0
    return json.dumps({
        "passed": passed,
        "exit_code": result.returncode,
        "output": result.stdout[-2000:],  # Truncate to prevent context explosion
        "summary": f"{'✅ ALL TESTS PASSED' if passed else '❌ SOME TESTS FAILED'}"
    })

Step 4: LLM-as-Judge (Independent Context Evaluation)

Key design: The Judge uses a completely fresh context window with zero conversation history:

def evaluate_code(code: str, criteria: str) -> str:
    """Independent LLM call to evaluate code quality — no generation bias."""
    response = client.messages.create(
        model=MODEL,
        system="You are a strict, impartial code reviewer. Score 0-10.",
        messages=[{
            "role": "user",
            "content": f"Criteria: {criteria}\n\nCode:\n{code}"
        }],
    )
    return response.content[0].text
    # Returns: {"score": 8, "feedback": "Clean and well-structured", "issues": [...]}

Key Code

# Complete TDAD loop — runs inside the agent loop
def execute_tool(name, args):
    if name == "run_tests":
        result = subprocess.run(
            ["python", "-m", "pytest", args["test_file"], "-v", "--tb=short"],
            capture_output=True, text=True
        )
        passed = result.returncode == 0
        return json.dumps({
            "passed": passed,
            "output": result.stdout[-2000:],
            "summary": f"{'✅ ALL TESTS PASSED' if passed else '❌ SOME TESTS FAILED'}"
        })

    if name == "evaluate":
        # Fresh context window — avoids self-evaluation bias
        response = client.messages.create(
            model=MODEL,
            system="You are a strict code reviewer. Score 0-10.",
            messages=[{"role": "user", "content": args["code"]}],
            max_tokens=1000,
        )
        return response.content[0].text

New Tools

ToolPurposeOutput Format
run_testsRun pytest test suite{"passed": bool, "output": "...", "summary": "..."}
evaluateLLM-as-Judge evaluation{"score": 0-10, "feedback": "...", "issues": [...]}

What's New (s02 → s13)

Aspects02 (Tools)s13 (Evals)
Quality assuranceNone — relies on human reviewAgent self-verification (TDAD + LLM-as-Judge)
Test executionNonerun_tests tool + structured results
EvaluationNoneIndependent LLM Judge (separating generation and evaluation)
Dev workflowWrite code → doneWrite tests → code → run → fix → retry
Feedback qualityNo feedbackStructured JSON (passed/failed + detailed output)
Iteration abilityOne-shotAuto-repair loop (up to N rounds)

Deep Dive: Design Decisions

Q1: How does TDAD differ from traditional TDD? Why is it especially effective for agents?

Traditional TDD is humans write tests, humans write code. TDAD is agents write tests, agents write code — but the key difference is the degree of feedback loop automation:

TDD (Human)TDAD (Agent)
Who writes testsHuman developerAgent (from requirement description)
Who runs testsCI/manuallyAgent automatically calls run_tests
Who analyzes failuresHuman reads logsAgent parses structured JSON
Who fixesHumanAgent, based on precise failure info
Loop speedMinutesSeconds

TDAD is especially effective for agents because: agents are bad at self-evaluation. Asking an agent "is my code good?" almost always yields optimistic answers. But asking "did the tests pass?" is an objective boolean judgment — no ambiguity.

Q2: Is LLM-as-Judge reliable? Won't it just praise its own work?

This is extensively discussed in Anthropic's 2026 evaluation blogs. Three key design choices prevent bias:

  1. Independent context window: The Judge API call has zero conversation history — it doesn't know who wrote the code
  2. Role separation: The system prompt explicitly requires "you are a strict code reviewer," not "help the user"
  3. Structured rubric: Don't ask "is it good?" — score specific dimensions (readability, error handling, performance)
# ❌ Unreliable: self-evaluation in the same context
messages.append({"role": "user", "content": "Rate your code 0-10"})

# ✅ Reliable: independent context + strict role + specific rubric
client.messages.create(
    system="You are a strict reviewer. Score each dimension 0-10.",
    messages=[{"role": "user", "content": f"Review: {code}"}],
)

Best practice: LLM Judges should be periodically calibrated against human annotations. Anthropic recommends using 10-20 human-scored samples to validate Judge accuracy.

Q3: Why does run_tests truncate output to 2000 characters? Won't it lose critical info?

This is a practical application of Context Engineering (s15). Test output often contains massive redundant information (stack traces from library internals, repeated fixture setup logs). Risks of not truncating:

  • A single test failure can produce 50,000 tokens of output
  • Multiple failures accumulate and can cause context window explosion
  • The agent actually gets lost in noise, missing the real error message

The truncation strategy [-2000:] takes the last 2000 characters, which typically contains:

  • pytest's summary line (3 passed, 2 failed)
  • The last failing assertion and error message
  • These are exactly what the agent needs most to fix the code

Q4: Can the fix loop run forever? When should it give up?

Three stopping conditions prevent infinite loops:

MAX_RETRIES = 3  # Hard upper limit

for attempt in range(MAX_RETRIES):
    result = run_tests(test_file)
    if result["passed"]:
        break  # Condition 1: All tests pass

    if attempt > 0 and result["output"] == previous_output:
        break  # Condition 2: Same error repeated — agent is stuck

    previous_output = result["output"]

# Condition 3: Max retries reached — report to user
if not result["passed"]:
    return "Unable to fix after 3 attempts. Details: ..."

Anthropic's recommendation: Small samples suffice — you don't need 100 retries. Research shows agents produce the largest improvements in the first 2-3 iterations ("large effect size, small sample size is fine"), after which returns diminish.

Q5: How should Code Graders and LLM Graders be combined?

Anthropic's 2026 evaluation blogs define three Grader types, recommending combining them:

Grader TypeStrengthWeaknessExample
Code GraderFast, objective, zero costCan only check quantifiable metricsUnit tests, type checks, lint
LLM GraderFlexible, evaluates subjective qualityHas bias, has costCode style, architecture review
Human GraderHighest quality standardSlow, doesn't scaleCalibrating LLM Grader, spot-checks

Best practice combination in s13:

run_tests (Code Grader)  →  Fast, objective verification of functional correctness
      ↓ After passing
evaluate (LLM Grader)    →  Evaluate code quality, readability, best practices
      ↓ If score < threshold
Human Review              →  Periodic sampling to calibrate LLM Judge accuracy

Try It

cd learn-claude-code
python agents/s13_evals.py

Recommended prompts:

  • "Write a function that converts Celsius to Fahrenheit with proper error handling" — watch the TDAD loop
  • "Create a simple stack data structure with push, pop, and peek" — triggers LLM-as-Judge evaluation
  • "Build a function that validates email addresses" — see how the agent handles test failures and fixes
  • "Implement a binary search function" — multi-round fix-retry loop with complex logic

References

  • Demystifying evals for AI agents — Anthropic, Jan 2026. Defines the Capability vs Regression Evals framework, three Grader types (Code/LLM/Human), and the core principle "Grade the outcome, not the path."
  • Building AI-Resistant Evaluations — Anthropic, Jan 2026. Discusses how to design evaluation tasks that agents can't easily "game," including balanced datasets and avoiding ambiguous success criteria.
  • Eval Awareness in Claude — Anthropic, Mar 2026. Studies whether models can detect that they're being evaluated, and implications for evaluation design.
  • SWE-bench & Terminal-Bench — Anthropic. Discusses how infrastructure noise (CPU/RAM limits, tool availability) affects agent coding benchmark reliability.
  • Building Effective Agents — Anthropic, Dec 2025. The Evaluator-Optimizer workflow pattern's original definition; TDAD in s13 is a concrete implementation of this pattern.