Compact
Memory ManagementThree-Layer Compression
Context will fill up; three-layer compression strategy enables infinite sessions
s01 > s02 > s03 > s04 > s05 > [ s06 ] | s07 > s08 > s09 > s10 > s11 > s12
"Context will fill up; you need a way to make room" -- three-layer compression strategy for infinite sessions.
Harness layer: Compression -- clean memory for infinite sessions.
Problem
The context window is finite. A single read_file on a 1000-line file costs ~4000 tokens. After reading 30 files and running 20 bash commands, you hit 100,000+ tokens. The agent cannot work on large codebases without compression.
Solution
Three layers, increasing in aggressiveness:
Every turn:
+------------------+
| Tool call result |
+------------------+
|
v
[Layer 1: micro_compact] (silent, every turn)
Replace tool_result > 3 turns old
with "[Previous: used {tool_name}]"
|
v
[Check: tokens > 50000?]
| |
no yes
| |
v v
continue [Layer 2: auto_compact]
Save transcript to .transcripts/
LLM summarizes conversation.
Replace all messages with [summary].
|
v
[Layer 3: compact tool]
Model calls compact explicitly.
Same summarization as auto_compact.
How It Works
- Layer 1 -- micro_compact: Before each LLM call, replace old tool results with placeholders.
def micro_compact(messages: list) -> list:
tool_results = []
for i, msg in enumerate(messages):
if msg["role"] == "user" and isinstance(msg.get("content"), list):
for j, part in enumerate(msg["content"]):
if isinstance(part, dict) and part.get("type") == "tool_result":
tool_results.append((i, j, part))
if len(tool_results) <= KEEP_RECENT:
return messages
for _, _, part in tool_results[:-KEEP_RECENT]:
if len(part.get("content", "")) > 100:
part["content"] = f"[Previous: used {tool_name}]"
return messages
- Layer 2 -- auto_compact: When tokens exceed threshold, save full transcript to disk, then ask the LLM to summarize.
def auto_compact(messages: list) -> list:
# Save transcript for recovery
transcript_path = TRANSCRIPT_DIR / f"transcript_{int(time.time())}.jsonl"
with open(transcript_path, "w") as f:
for msg in messages:
f.write(json.dumps(msg, default=str) + "\n")
# LLM summarizes
response = client.messages.create(
model=MODEL,
messages=[{"role": "user", "content":
"Summarize this conversation for continuity..."
+ json.dumps(messages, default=str)[:80000]}],
max_tokens=2000,
)
return [
{"role": "user", "content": f"[Compressed]\n\n{response.content[0].text}"},
{"role": "assistant", "content": "Understood. Continuing."},
]
-
Layer 3 -- manual compact: The
compacttool triggers the same summarization on demand. -
The loop integrates all three:
def agent_loop(messages: list):
while True:
micro_compact(messages) # Layer 1
if estimate_tokens(messages) > THRESHOLD:
messages[:] = auto_compact(messages) # Layer 2
response = client.messages.create(...)
# ... tool execution ...
if manual_compact:
messages[:] = auto_compact(messages) # Layer 3
Transcripts preserve full history on disk. Nothing is truly lost -- just moved out of active context.
What Changed From s05
| Component | Before (s05) | After (s06) |
|---|---|---|
| Tools | 5 | 5 (base + compact) |
| Context mgmt | None | Three-layer compression |
| Micro-compact | None | Old results -> placeholders |
| Auto-compact | None | Token threshold trigger |
| Transcripts | None | Saved to .transcripts/ |
Deep Dive: Design Decisions
Q1: Doesn't micro_compact break things by deleting tool results?
micro_compact doesn't "delete" -- it "slims down". It replaces the content of old tool_results but preserves the full message structure:
# Before
{"type": "tool_result", "tool_use_id": "abc123", "content": "#!/usr/bin/env python3\n...(thousands of chars)"}
# After -- structure intact, content shortened
{"type": "tool_result", "tool_use_id": "abc123", "content": "[Previous: used read_file]"}
This preserves the API contract. Anthropic's API requires every tool_use to have a matching tool_result. Deleting the entire tool_result would cause an API error. micro_compact keeps the structure and tool_use_id mapping, just shrinks the content.
Q2: Why is the placeholder [Previous: used {tool_name}]?
The placeholder preserves minimal useful information:
- Square brackets
[...]signal metadata -- the model understands this isn't real tool output - Including the tool name tells the model "I previously used read_file" or "I ran a bash command"
- More informative than a blank placeholder like
[cleared]-- the model can infer what type of operation was performed
Core principle: Preserve "what I did" with minimal tokens, discard "what the detailed output was".
Q3: What if the user asks about a cleared tool result?
The model re-invokes the tool. For example:
User: "What was in that config.json we read earlier?"
Model sees in context:
- tool_use: read_file(path="config.json")
- tool_result: "[Previous: used read_file]" ← content was cleared
Model's reaction: "I read this file before but the content is gone. Let me read it again."
→ Calls read_file("config.json")
→ Gets current content, answers user
This is a feature, not a bug:
- Guarantees fresh data -- the file may have been modified since; re-reading is more reliable than recalling stale content
- Tool calls are cheap -- reading a file takes milliseconds, far cheaper than carrying thousands of tokens in every API call
For non-replayable operations (like one-time API calls), auto_compact's
.transcripts/archive serves as the safety net for human review.
Q4: Does a small THRESHOLD hurt response quality?
Yes, significantly. The core tradeoff: compression cost = information loss.
| THRESHOLD | Effect | Issue |
|---|---|---|
5000 (demo) | Triggers every few turns | Model quickly forgets previous work, may repeat operations or lose key decisions |
50000 (default) | Supports dozens of complex turns | Good balance for general use |
200000 (near limit) | Rarely compresses | Highest token cost, but fullest context |
Critically: each compression loses a layer of detail. If a small THRESHOLD causes repeated compression, you get "a summary of a summary of a summary" -- information decays rapidly.
Compression is a last-resort survival mechanism, not a feature to maximize. Real Claude Code sets the threshold at ~80-90% of the model's context window, delaying compression as long as possible.
Q5: What granularity does auto_compact's summary achieve?
The summary granularity is determined by the prompt and max_tokens=2000. The prompt asks the model to summarize three things:
- What was accomplished -- high-level outcomes, no code details
- Current state -- progress and remaining work
- Key decisions -- what approach was chosen and why
This produces roughly 500-800 words -- approximately "project manager daily report" granularity. A 30-turn conversation compresses to something like:
User requested auth module refactor. Read auth.py (320 lines),
found deprecated JWT library. Upgraded PyJWT from 1.x to 2.x,
changed decode() call (algorithms now required parameter).
Also fixed token expiry from 1h to 24h.
Code changes complete and tests passing. TODO: update requirements.txt.
Preserved: what was done, why, what changed, next steps. Lost: code diffs, trial-and-error process, full file contents, exact command outputs.
If the model needs specific code later, it re-reads files with tools -- the same philosophy as micro_compact, applied at a higher level.
Try It
cd learn-claude-code
python agents/s06_context_compact.py
Read every Python file in the agents/ directory one by one(watch micro-compact replace old results)Keep reading files until compression triggers automaticallyUse the compact tool to manually compress the conversation