s13

Agent Evals

Observability & Evaluation

TDAD + LLM-as-Judge

256 LOC5 toolsrun_tests + evaluate tools with structured results

An agent that can't verify its own work is just guessing; verifiable output is the key to success

TDAD + LLM-as-Judge Evals

TDAD: Write tests → Implement → Run → Fix → Verify

Requirement

The agent receives a task. How does it verify correctness?

1/6

s01 > s02 > s03 > s04 > s05 > s06 | s07 > s08 > s09 > s10 > s11 > s12 | [ s13 ] s14 > s15 > s16 > s17 > s18 > s19 > s20 > s21

"An agent that can't verify its own work is just guessing." -- Verifiable output is the key to agent success.

Harness layer: Observability & Evaluation -- How agents know they got it right.

Problem

An agent writes code, but how does it know the code is correct? Previously we relied on human review, but this is slow and doesn't scale. Agents can verify their own work: write tests, run tests, analyze failures, fix bugs, retry. The key is giving the agent a programmable feedback loop rather than relying on the model's "gut feeling."

Solution

            ┌─────────────┐
            │  📋 Request  │  ← User states requirement
            └──────┬──────┘
                   │
            ┌──────▼──────┐
            │ 📝 Write    │  ← TDAD: Write tests first, define "correct"
            │    Tests    │
            └──────┬──────┘
                   │
            ┌──────▼──────┐
            │ 💻 Implement │  ← Agent writes code to satisfy tests
            │    Code     │
            └──────┬──────┘
                   │
            ┌──────▼──────┐
      ┌────►│ ▶ Run Tests │  ← Run tests, get structured feedback
      │     └──────┬──────┘
      │            │
      │    pass? ──┴── fail?
      │    ✅           ↓
      │          ┌──────────┐
      └──────────│ 🔧 Fix   │  ← Fix based on test output (not guessing)
                 │ & Retry  │
                 └──────────┘
                      │
               All pass ↓
            ┌─────────────┐
            │ 🧑‍⚖️ LLM Judge │  ← Fresh context, unbiased quality eval
            └─────────────┘

Two complementary evaluation modes:

Mode	Principle	Best For
TDAD (Test-Driven Agentic Development)	Agent writes tests first, tests are ground truth	Quantifiable correctness (functions, APIs, data)
LLM-as-Judge	Independent LLM call evaluates another LLM's output	Subjective quality (style, architecture, readability)

How It Works

Step 1: Write Tests (TDAD Core)

The agent generates test cases from the requirement, defining what "correct" means — this is the fundamental difference from traditional coding agents:

# Agent-generated test file: test_solution.py
def test_basic_addition():
    assert add(2, 3) == 5

def test_negative_numbers():
    assert add(-1, -2) == -3

def test_type_error():
    with pytest.raises(TypeError):
        add("a", 1)

Step 2: Implement Code

The agent writes code to pass all tests:

def add(a: int, b: int) -> int:
    if not isinstance(a, (int, float)) or not isinstance(b, (int, float)):
        raise TypeError("Arguments must be numbers")
    return a + b

Step 3: Run Tests → Get Structured Feedback

The run_tests tool returns structured JSON, not raw text logs:

def run_tests(test_file: str) -> str:
    result = subprocess.run(
        ["python", "-m", "pytest", test_file, "-v", "--tb=short"],
        capture_output=True, text=True
    )
    passed = result.returncode == 0
    return json.dumps({
        "passed": passed,
        "exit_code": result.returncode,
        "output": result.stdout[-2000:],  # Truncate to prevent context explosion
        "summary": f"{'✅ ALL TESTS PASSED' if passed else '❌ SOME TESTS FAILED'}"
    })

Step 4: LLM-as-Judge (Independent Context Evaluation)

Key design: The Judge uses a completely fresh context window with zero conversation history:

def evaluate_code(code: str, criteria: str) -> str:
    """Independent LLM call to evaluate code quality — no generation bias."""
    response = client.messages.create(
        model=MODEL,
        system="You are a strict, impartial code reviewer. Score 0-10.",
        messages=[{
            "role": "user",
            "content": f"Criteria: {criteria}\n\nCode:\n{code}"
        }],
    )
    return response.content[0].text
    # Returns: {"score": 8, "feedback": "Clean and well-structured", "issues": [...]}

Key Code

# Complete TDAD loop — runs inside the agent loop
def execute_tool(name, args):
    if name == "run_tests":
        result = subprocess.run(
            ["python", "-m", "pytest", args["test_file"], "-v", "--tb=short"],
            capture_output=True, text=True
        )
        passed = result.returncode == 0
        return json.dumps({
            "passed": passed,
            "output": result.stdout[-2000:],
            "summary": f"{'✅ ALL TESTS PASSED' if passed else '❌ SOME TESTS FAILED'}"
        })

    if name == "evaluate":
        # Fresh context window — avoids self-evaluation bias
        response = client.messages.create(
            model=MODEL,
            system="You are a strict code reviewer. Score 0-10.",
            messages=[{"role": "user", "content": args["code"]}],
            max_tokens=1000,
        )
        return response.content[0].text

New Tools

Tool	Purpose	Output Format
`run_tests`	Run pytest test suite	`{"passed": bool, "output": "...", "summary": "..."}`
`evaluate`	LLM-as-Judge evaluation	`{"score": 0-10, "feedback": "...", "issues": [...]}`

What's New (s02 → s13)

Aspect	s02 (Tools)	s13 (Evals)
Quality assurance	None — relies on human review	Agent self-verification (TDAD + LLM-as-Judge)
Test execution	None	`run_tests` tool + structured results
Evaluation	None	Independent LLM Judge (separating generation and evaluation)
Dev workflow	Write code → done	Write tests → code → run → fix → retry
Feedback quality	No feedback	Structured JSON (passed/failed + detailed output)
Iteration ability	One-shot	Auto-repair loop (up to N rounds)

Deep Dive: Design Decisions

Q1: How does TDAD differ from traditional TDD? Why is it especially effective for agents?

Traditional TDD is humans write tests, humans write code. TDAD is agents write tests, agents write code — but the key difference is the degree of feedback loop automation:

	TDD (Human)	TDAD (Agent)
Who writes tests	Human developer	Agent (from requirement description)
Who runs tests	CI/manually	Agent automatically calls `run_tests`
Who analyzes failures	Human reads logs	Agent parses structured JSON
Who fixes	Human	Agent, based on precise failure info
Loop speed	Minutes	Seconds

TDAD is especially effective for agents because: agents are bad at self-evaluation. Asking an agent "is my code good?" almost always yields optimistic answers. But asking "did the tests pass?" is an objective boolean judgment — no ambiguity.

Q2: Is LLM-as-Judge reliable? Won't it just praise its own work?

This is extensively discussed in Anthropic's 2026 evaluation blogs. Three key design choices prevent bias:

Independent context window: The Judge API call has zero conversation history — it doesn't know who wrote the code
Role separation: The system prompt explicitly requires "you are a strict code reviewer," not "help the user"
Structured rubric: Don't ask "is it good?" — score specific dimensions (readability, error handling, performance)

# ❌ Unreliable: self-evaluation in the same context
messages.append({"role": "user", "content": "Rate your code 0-10"})

# ✅ Reliable: independent context + strict role + specific rubric
client.messages.create(
    system="You are a strict reviewer. Score each dimension 0-10.",
    messages=[{"role": "user", "content": f"Review: {code}"}],
)

Best practice: LLM Judges should be periodically calibrated against human annotations. Anthropic recommends using 10-20 human-scored samples to validate Judge accuracy.

Q3: Why does `run_tests` truncate output to 2000 characters? Won't it lose critical info?

This is a practical application of Context Engineering (s15). Test output often contains massive redundant information (stack traces from library internals, repeated fixture setup logs). Risks of not truncating:

A single test failure can produce 50,000 tokens of output
Multiple failures accumulate and can cause context window explosion
The agent actually gets lost in noise, missing the real error message

The truncation strategy [-2000:] takes the last 2000 characters, which typically contains:

pytest's summary line (3 passed, 2 failed)
The last failing assertion and error message
These are exactly what the agent needs most to fix the code

Q4: Can the fix loop run forever? When should it give up?

Three stopping conditions prevent infinite loops:

MAX_RETRIES = 3  # Hard upper limit

for attempt in range(MAX_RETRIES):
    result = run_tests(test_file)
    if result["passed"]:
        break  # Condition 1: All tests pass

    if attempt > 0 and result["output"] == previous_output:
        break  # Condition 2: Same error repeated — agent is stuck

    previous_output = result["output"]

# Condition 3: Max retries reached — report to user
if not result["passed"]:
    return "Unable to fix after 3 attempts. Details: ..."

Anthropic's recommendation: Small samples suffice — you don't need 100 retries. Research shows agents produce the largest improvements in the first 2-3 iterations ("large effect size, small sample size is fine"), after which returns diminish.

Q5: How should Code Graders and LLM Graders be combined?

Anthropic's 2026 evaluation blogs define three Grader types, recommending combining them:

Grader Type	Strength	Weakness	Example
Code Grader	Fast, objective, zero cost	Can only check quantifiable metrics	Unit tests, type checks, lint
LLM Grader	Flexible, evaluates subjective quality	Has bias, has cost	Code style, architecture review
Human Grader	Highest quality standard	Slow, doesn't scale	Calibrating LLM Grader, spot-checks

Best practice combination in s13:

run_tests (Code Grader)  →  Fast, objective verification of functional correctness
      ↓ After passing
evaluate (LLM Grader)    →  Evaluate code quality, readability, best practices
      ↓ If score < threshold
Human Review              →  Periodic sampling to calibrate LLM Judge accuracy

Try It

cd learn-claude-code
python agents/s13_evals.py

Recommended prompts:

"Write a function that converts Celsius to Fahrenheit with proper error handling" — watch the TDAD loop
"Create a simple stack data structure with push, pop, and peek" — triggers LLM-as-Judge evaluation
"Build a function that validates email addresses" — see how the agent handles test failures and fixes
"Implement a binary search function" — multi-round fix-retry loop with complex logic

References

Demystifying evals for AI agents — Anthropic, Jan 2026. Defines the Capability vs Regression Evals framework, three Grader types (Code/LLM/Human), and the core principle "Grade the outcome, not the path."
Building AI-Resistant Evaluations — Anthropic, Jan 2026. Discusses how to design evaluation tasks that agents can't easily "game," including balanced datasets and avoiding ambiguous success criteria.
Eval Awareness in Claude — Anthropic, Mar 2026. Studies whether models can detect that they're being evaluated, and implications for evaluation design.
SWE-bench & Terminal-Bench — Anthropic. Discusses how infrastructure noise (CPU/RAM limits, tool availability) affects agent coding benchmark reliability.
Building Effective Agents — Anthropic, Dec 2025. The Evaluator-Optimizer workflow pattern's original definition; TDAD in s13 is a concrete implementation of this pattern.

TDAD + LLM-as-Judge Evals

Problem

Solution

How It Works

Step 1: Write Tests (TDAD Core)

Step 2: Implement Code

Step 3: Run Tests → Get Structured Feedback

Step 4: LLM-as-Judge (Independent Context Evaluation)

Key Code

New Tools

What's New (s02 → s13)

Deep Dive: Design Decisions

Q1: How does TDAD differ from traditional TDD? Why is it especially effective for agents?

Q2: Is LLM-as-Judge reliable? Won't it just praise its own work?

Q3: Why does run_tests truncate output to 2000 characters? Won't it lose critical info?

Q4: Can the fix loop run forever? When should it give up?

Q5: How should Code Graders and LLM Graders be combined?

Try It

References

TDAD + LLM-as-Judge Evals

Q3: Why does `run_tests` truncate output to 2000 characters? Won't it lose critical info?