s06

Compact

Memory Management

Three-Layer Compression

290 LOC5 toolsmicro-compact + auto-compact + archival

Context will fill up; three-layer compression strategy enables infinite sessions

Three-Layer Context Compression

Context Window

USR 1

AST 2

TRL 3

USR 4

AST 5

TRL 6

USR 7

AST 8

30%

30K

/ 100K

Token usage30,000 / 100,000

user

assistant

tool_result

Context as Memory

messages[] is the agent's working memory. It has a hard limit.

1/6

s01 > s02 > s03 > s04 > s05 > [ s06 ] | s07 > s08 > s09 > s10 > s11 > s12

"Context will fill up; you need a way to make room" -- three-layer compression strategy for infinite sessions.

Harness layer: Compression -- clean memory for infinite sessions.

Problem

The context window is finite. A single read_file on a 1000-line file costs ~4000 tokens. After reading 30 files and running 20 bash commands, you hit 100,000+ tokens. The agent cannot work on large codebases without compression.

Solution

Three layers, increasing in aggressiveness:

Every turn:
+------------------+
| Tool call result |
+------------------+
        |
        v
[Layer 1: micro_compact]        (silent, every turn)
  Replace tool_result > 3 turns old
  with "[Previous: used {tool_name}]"
        |
        v
[Check: tokens > 50000?]
   |               |
   no              yes
   |               |
   v               v
continue    [Layer 2: auto_compact]
              Save transcript to .transcripts/
              LLM summarizes conversation.
              Replace all messages with [summary].
                    |
                    v
            [Layer 3: compact tool]
              Model calls compact explicitly.
              Same summarization as auto_compact.

How It Works

Layer 1 -- micro_compact: Before each LLM call, replace old tool results with placeholders.

def micro_compact(messages: list) -> list:
    tool_results = []
    for i, msg in enumerate(messages):
        if msg["role"] == "user" and isinstance(msg.get("content"), list):
            for j, part in enumerate(msg["content"]):
                if isinstance(part, dict) and part.get("type") == "tool_result":
                    tool_results.append((i, j, part))
    if len(tool_results) <= KEEP_RECENT:
        return messages
    for _, _, part in tool_results[:-KEEP_RECENT]:
        if len(part.get("content", "")) > 100:
            part["content"] = f"[Previous: used {tool_name}]"
    return messages

Layer 2 -- auto_compact: When tokens exceed threshold, save full transcript to disk, then ask the LLM to summarize.

def auto_compact(messages: list) -> list:
    # Save transcript for recovery
    transcript_path = TRANSCRIPT_DIR / f"transcript_{int(time.time())}.jsonl"
    with open(transcript_path, "w") as f:
        for msg in messages:
            f.write(json.dumps(msg, default=str) + "\n")
    # LLM summarizes
    response = client.messages.create(
        model=MODEL,
        messages=[{"role": "user", "content":
            "Summarize this conversation for continuity..."
            + json.dumps(messages, default=str)[:80000]}],
        max_tokens=2000,
    )
    return [
        {"role": "user", "content": f"[Compressed]\n\n{response.content[0].text}"},
        {"role": "assistant", "content": "Understood. Continuing."},
    ]

Layer 3 -- manual compact: The compact tool triggers the same summarization on demand.
The loop integrates all three:

def agent_loop(messages: list):
    while True:
        micro_compact(messages)                        # Layer 1
        if estimate_tokens(messages) > THRESHOLD:
            messages[:] = auto_compact(messages)       # Layer 2
        response = client.messages.create(...)
        # ... tool execution ...
        if manual_compact:
            messages[:] = auto_compact(messages)       # Layer 3

Transcripts preserve full history on disk. Nothing is truly lost -- just moved out of active context.

What Changed From s05

Component	Before (s05)	After (s06)
Tools	5	5 (base + compact)
Context mgmt	None	Three-layer compression
Micro-compact	None	Old results -> placeholders
Auto-compact	None	Token threshold trigger
Transcripts	None	Saved to .transcripts/

Deep Dive: Design Decisions

Q1: Doesn't micro_compact break things by deleting tool results?

micro_compact doesn't "delete" -- it "slims down". It replaces the content of old tool_results but preserves the full message structure:

# Before
{"type": "tool_result", "tool_use_id": "abc123", "content": "#!/usr/bin/env python3\n...(thousands of chars)"}

# After -- structure intact, content shortened
{"type": "tool_result", "tool_use_id": "abc123", "content": "[Previous: used read_file]"}

This preserves the API contract. Anthropic's API requires every tool_use to have a matching tool_result. Deleting the entire tool_result would cause an API error. micro_compact keeps the structure and tool_use_id mapping, just shrinks the content.

Q2: Why is the placeholder `[Previous: used {tool_name}]`?

The placeholder preserves minimal useful information:

Square brackets [...] signal metadata -- the model understands this isn't real tool output
Including the tool name tells the model "I previously used read_file" or "I ran a bash command"
More informative than a blank placeholder like [cleared] -- the model can infer what type of operation was performed

Core principle: Preserve "what I did" with minimal tokens, discard "what the detailed output was".

Q3: What if the user asks about a cleared tool result?

The model re-invokes the tool. For example:

User: "What was in that config.json we read earlier?"

Model sees in context:
  - tool_use: read_file(path="config.json")
  - tool_result: "[Previous: used read_file]"   ← content was cleared

Model's reaction: "I read this file before but the content is gone. Let me read it again."
  → Calls read_file("config.json")
  → Gets current content, answers user

This is a feature, not a bug:

Guarantees fresh data -- the file may have been modified since; re-reading is more reliable than recalling stale content
Tool calls are cheap -- reading a file takes milliseconds, far cheaper than carrying thousands of tokens in every API call

For non-replayable operations (like one-time API calls), auto_compact's .transcripts/ archive serves as the safety net for human review.

Q4: Does a small THRESHOLD hurt response quality?

Yes, significantly. The core tradeoff: compression cost = information loss.

THRESHOLD	Effect	Issue
`5000` (demo)	Triggers every few turns	Model quickly forgets previous work, may repeat operations or lose key decisions
`50000` (default)	Supports dozens of complex turns	Good balance for general use
`200000` (near limit)	Rarely compresses	Highest token cost, but fullest context

Critically: each compression loses a layer of detail. If a small THRESHOLD causes repeated compression, you get "a summary of a summary of a summary" -- information decays rapidly.

Compression is a last-resort survival mechanism, not a feature to maximize. Real Claude Code sets the threshold at ~80-90% of the model's context window, delaying compression as long as possible.

Q5: What granularity does auto_compact's summary achieve?

The summary granularity is determined by the prompt and max_tokens=2000. The prompt asks the model to summarize three things:

What was accomplished -- high-level outcomes, no code details
Current state -- progress and remaining work
Key decisions -- what approach was chosen and why

This produces roughly 500-800 words -- approximately "project manager daily report" granularity. A 30-turn conversation compresses to something like:

User requested auth module refactor. Read auth.py (320 lines),
found deprecated JWT library. Upgraded PyJWT from 1.x to 2.x,
changed decode() call (algorithms now required parameter).
Also fixed token expiry from 1h to 24h.
Code changes complete and tests passing. TODO: update requirements.txt.

Preserved: what was done, why, what changed, next steps. Lost: code diffs, trial-and-error process, full file contents, exact command outputs.

If the model needs specific code later, it re-reads files with tools -- the same philosophy as micro_compact, applied at a higher level.

Try It

cd learn-claude-code
python agents/s06_context_compact.py

Read every Python file in the agents/ directory one by one (watch micro-compact replace old results)
Keep reading files until compression triggers automatically
Use the compact tool to manually compress the conversation

Three-Layer Context Compression

Problem

Solution

How It Works

What Changed From s05

Deep Dive: Design Decisions

Q1: Doesn't micro_compact break things by deleting tool results?

Q2: Why is the placeholder [Previous: used {tool_name}]?

Q3: What if the user asks about a cleared tool result?

Q4: Does a small THRESHOLD hurt response quality?

Q5: What granularity does auto_compact's summary achieve?

Try It

Three-Layer Context Compression

Q2: Why is the placeholder `[Previous: used {tool_name}]`?