Context Condensing
Overview
When working on complex tasks, conversations with Kilo Code can grow long and consume a significant portion of the AI model's context window. Context Condensing is a feature that intelligently summarizes your conversation history, reducing token usage while preserving the essential information needed to continue your work effectively.
The Problem: Context Window Limits
Every AI model has a maximum context window — a limit on how much text it can process at once. As your conversation grows with code snippets, file contents, and back-and-forth discussions, you may approach this limit. When this happens, you might experience:
- Slower responses as the model processes more tokens
- Higher API costs due to increased token usage
- Eventually hitting the context limit and being unable to continue
The Solution: Auto-Compaction
Kilo Code uses a Compaction system to manage context automatically. When your conversation approaches the token limit, compaction kicks in and produces an anchored summary that captures:
- The overall goal of the session
- Constraints and preferences you gave along the way
- Progress, key decisions, and next steps
- Critical context needed to continue
- Relevant files and directories
This summary replaces older conversation history while Kilo keeps the most recent turns verbatim when they fit. If a session has already been compacted, Kilo updates the previous summary instead of starting over, preserving still-relevant details and removing stale ones.
How Compaction Triggers
Automatic trigger
Kilo tracks the total token count for the session — input, output, and cached reads and writes — and compares it to the model's context window. Compaction runs when the total fills the window minus a reserved buffer of headroom kept free for the next turn.
How the buffer is chosen depends on what the model declares. When the model advertises a separate input limit, the buffer defaults to 20,000 tokens (or the model's maximum output size, whichever is smaller). When the model only declares a single context window, Kilo instead reserves the model's full output cap — up to 32,000 tokens.
Custom models that do not declare a context window are not tracked, and auto-compaction does not run for them.
Context Pruning
Between turns, Kilo also runs a lighter prune pass. It walks completed tool outputs outside a 40,000-token recency window and replaces them with "[Old tool result content cleared]". Pruning runs incrementally so large tool outputs don't consume space forever, even before full compaction is needed.
Manual Compaction
You can trigger compaction at any time:
- Slash command: type
/compactin chat (also findable by typingsmolorcondense) - Task header button: click the compact icon in the active task header
- Settings: toggle auto-compaction in Settings → Context
Defaults
| Setting | Default | Effect |
|---|---|---|
compaction.auto | true | Automatically compact when the usable window is reached |
compaction.prune | true | Clear old tool outputs beyond the 40K recency window |
compaction.tail_turns | 2 | Keep the most recent user turns and their responses verbatim when possible |
compaction.preserve_recent_tokens | 25% of usable context, clamped between 2,000 and 8,000 tokens | Token budget for the verbatim recent tail |
compaction.reserved | min(20,000, model_max_output_tokens) | Token headroom kept free for the next turn — also defines the compaction trigger point |
Configuration
Compaction is configured in your kilo.jsonc file:
{
"compaction": {
"auto": true, // Enable or disable automatic compaction
"prune": true, // Enable pruning of old tool outputs beyond the recency window
"tail_turns": 2, // Recent user turns to keep verbatim during compaction
"preserve_recent_tokens": 8000, // Maximum token budget for the recent tail
"reserved": 20000, // Token buffer kept free; smaller = later trigger, larger = earlier trigger
},
}
| Option | Type | Default | Description |
|---|---|---|---|
compaction.auto | boolean | true | Enable or disable automatic compaction when the usable window is reached |
compaction.prune | boolean | true | Enable pruning of old tool outputs outside the 40K token recency window |
compaction.tail_turns | number | 2 | Number of recent user turns, including following assistant and tool responses, to keep verbatim during compaction |
compaction.preserve_recent_tokens | number | 25% of usable context, clamped between 2,000 and 8,000 tokens | Maximum token budget for recent turns kept verbatim after compaction |
compaction.reserved | number | min(20000, model_max_output) | Token headroom reserved for the next turn. Applies only to models that advertise a separate input limit; models with a single context window use their full output cap as the reserve instead. |
Use a different model for compaction
Summarization can use a cheaper or larger-context model than your main agent. Configure a dedicated compaction agent:
{
"agent": {
"compaction": {
"model": "anthropic/claude-haiku-4-5",
},
},
}
If no compaction agent is set, the current session's model is used.
Environment overrides
| Variable | Effect |
|---|---|
KILO_DISABLE_AUTOCOMPACT=1 | Forces compaction.auto = false |
KILO_DISABLE_PRUNE=1 | Forces compaction.prune = false |
KILO_EXPERIMENTAL_OUTPUT_TOKEN_MAX | Overrides the 32,000 default output-token ceiling |
Best Practices
When to Compact
- Long sessions: If you've been working for an extended period on a complex task
- Before major transitions: When switching to a different aspect of your project
- When approaching limits: Run
/compactmanually before hitting the automatic trigger if you want control over when the summary is produced
Tuning compaction.reserved
On models that advertise a separate input limit, the reserved value is a trade-off:
- Lower value (e.g.
10000) → compaction triggers later, you get more turns out of the raw window, but you risk a mid-turn context overflow if a single response is larger than the buffer. - Higher value (e.g.
40000) → compaction triggers earlier, fewer overflow errors, but shorter effective conversations between summaries.
The default of ~20K is tuned to leave room for a full-size assistant response plus tool output. The setting has no effect on models with a single context window, which always reserve their full output cap instead.
Maintaining Context Quality
- Be specific in your initial task: A clear task description helps create better summaries
- Use AGENTS.md: Combine with AGENTS.md for persistent project context that doesn't need to be compacted
- Review the summary: After compaction, the summary is visible in your chat history
Related Features
- AGENTS.md - Persistent context storage across sessions
- Large Projects - Managing context for large codebases
- Codebase Indexing - Efficient code search and retrieval