Benchmarking
Summary
This document proposes a benchmarking system for Kilo Code with two primary goals:
- Compare models against one another using the same agent -- measuring task completion, token cost, and total time
- Compare agents against one another using the same model -- e.g., Kilo Code vs Claude Code, or Kilo Code v1.0 vs v1.1
The design leverages existing open source infrastructure rather than building a custom harness:
- Harbor as the evaluation framework, with Terminal-Bench and other datasets for task definitions
- ATIF (Agent Trajectory Interchange Format) for structured, per-step trace logging
- Opik for trace ingestion, step-level LLM judge evaluation, and root cause analysis
The key engineering deliverable is a Kilo Code Harbor adapter that runs Kilo CLI autonomously in containerized environments and emits ATIF-compliant trajectories.
This is separate from production observability, which monitors real user sessions via PostHog. Benchmarking is an offline evaluation system for comparing quality, cost, and performance across models and agents.
Problem Statement
As Kilo Code evolves, we need systematic answers to questions like:
- Did our latest release make the agent better or worse?
- Which model gives the best results for our users at a given price point?
- How does Kilo Code compare to Claude Code, Codex, or other agents on the same tasks?
- When a benchmark score drops, what specific step or decision caused the regression?
Today we have no structured way to answer these questions. Manual testing is not reproducible, and our existing PostHog telemetry does not capture the turn-by-turn detail needed for easy comparative analysis.
Goals
- Run Kilo Code against standardized benchmark datasets in a reproducible, containerized environment
- Compare model performance (same agent, different models) on task completion, token cost, and wall-clock time
- Compare agent performance (same model, different agents or Kilo versions) on the same metrics
- Capture detailed per-step traces for root cause analysis when results differ
- Make it easy to create custom task sets for targeted evaluation or marketing purposes
Non-goals:
- Production monitoring (covered by Agent Observability)
- Automated remediation based on benchmark results
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Harbor Framework β
β β
β ββββββββββββββββ βββββββββββββββ βββββββββββββββββββ β
β βTerminal-Benchβ β SWE-bench β β Custom Tasks β β
β β 2.0 β β β β (Kilo-specific) β β
β ββββββββ¬ββββββββ ββββββββ¬βββββββ βββββββββ¬ββββββββββ β
β ββββββββββββββββββΌββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββ β
β β Containerized Trial β β
β β β β
β β βββββββββββββββββββ β β
β β β Agent Under β β β
β β β Test β β β
β β β (kilo --auto) β β β
β β ββββββββββ¬βββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββ β β
β β β Model API β β β
β β β (Opus, GPT-5, β β β
β β β Gemini, etc.) β β β
β β βββββββββββββββββββ β β
β βββββββββββββ¬ββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββ β
β β ATIF Trajectory β β
β β (per-step traces) β β
β βββββββββββββ¬ββββββββββββ β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
βΌ βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β tbench.ai Dashboard β β Opik β
β - Leaderboard β β - Step-level traces β
β - Task pass/fail β β - LLM judge per step β
β - Asciinema replay β β - Cost attribution β
β - Aggregate scores β β - Root cause comparison β
ββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
Components
Harbor Framework
Harbor is the evaluation framework built by the Terminal-Bench team. It provides:
- Containerized environments for reproducible task execution
- Pre-integrated agents: Claude Code, Codex, Gemini CLI, OpenHands, Terminus-2
- A registry of benchmark datasets: Terminal-Bench, SWE-bench, LiveCodeBench, and more
- Cloud scaling via Daytona, Modal, and E2B for running trials in parallel
- Automatic ATIF trajectory generation for all integrated agents
Harbor is the standard evaluation framework used by many frontier labs. Rather than building our own harness, we write a Kilo Code adapter and plug into the existing ecosystem.
ATIF (Agent Trajectory Interchange Format)
ATIF is a standardized JSON format for logging the complete interaction history of an agent run. Each trajectory captures:
- Every step: User messages, agent responses, tool calls, observations
- Per-step metrics: Token counts (prompt, completion, cached), cost in USD, latency
- Tool call detail: Function name, arguments, and observation results
- Reasoning content: The agent's internal reasoning at each step (when available)
- Aggregate metrics: Total tokens, total cost, total steps
This granularity is what enables step-level comparison between runs -- not just "did it pass or fail" but "at step 7, Agent A chose tool X while Agent B chose tool Y."
Opik
Opik (by Comet) provides trace ingestion and analysis with a first-class Harbor integration. Running benchmarks through Opik is as simple as:
opik harbor run -d terminal-bench@head -a kilo -m anthropic/claude-opus-4
Opik adds value beyond what the tbench.ai dashboard provides:
| Capability | tbench.ai Dashboard | Opik |
|---|---|---|
| Task-level pass/fail | Yes | Yes |
| Aggregate leaderboard | Yes | No |
| Asciinema replay | Yes | No |
| Step-level trace view | No | Yes |
| Step-level LLM judge | No | Yes |
| Cost attribution per step | No | Yes |
| Side-by-side trace comparison | No | Yes |
| Root cause analysis | No | Yes |
The two dashboards are complementary: tbench.ai for high-level leaderboard comparisons, Opik for drilling into why a specific run succeeded or failed.
Datasets
Harbor's registry provides access to established benchmark datasets. The choice of dataset can vary depending on what you are evaluating:
| Dataset | Focus | Use Case |
|---|---|---|
| Terminal-Bench 2.0 | CLI/terminal tasks (89 tasks) | General agent capability on hard, realistic tasks |
| SWE-bench | Real GitHub issues in real repos | Software engineering task completion |
| LiveCodeBench | Competitive programming problems | Code generation quality |
| Custom task sets | Whatever you define | Targeted evaluation, marketing, regression testing |
Creating Custom Task Sets
Creating a custom Harbor task set is straightforward. Each task consists of:
- A Dockerfile defining the environment (OS, installed packages, repo state)
- A task description (the prompt given to the agent)
- A verification script (tests that determine pass/fail)
- Optionally, a reference solution
This makes it easy to create task sets that target specific Kilo Code capabilities -- for example, a set of refactoring tasks, or a set of multi-file debugging scenarios. Custom sets can be published to the Harbor registry or kept private.
See the Harbor task tutorial for a step-by-step guide.
Deliverables
1. Kilo Code Harbor Adapter
The primary engineering deliverable. This adapter:
- Installs Kilo CLI in a Docker container
- Configures autonomous execution using
kilo run --auto, which disables all permission prompts so the agent runs fully unattended - Translates Harbor task prompts into Kilo CLI invocations
- Emits ATIF-compliant trajectories capturing every step, tool call, and metric
The adapter follows the same pattern as existing Harbor agents (see the OpenHands adapter for reference). The key implementation detail is the populate_context_post_run method that converts Kilo's execution log into ATIF format.
Autonomous execution is critical. Harbor runs containerized trials in parallel and expects agents to execute from start to finish without human intervention. The adapter must ensure:
- No interactive prompts for API keys (injected via environment variables)
- No permission dialogs for file writes, command execution, etc.
- Graceful timeout handling if the agent gets stuck
2. Custom Task Set Template
Documentation and examples for creating Kilo-specific task sets:
- Template Dockerfile and verification script
- Guidelines for writing good task descriptions
- Examples of tasks that highlight coding agent capabilities
- Instructions for publishing to Harbor's registry or running privately
This enables the team to create targeted benchmarks for marketing, regression testing, or capability evaluation.
3. Opik Integration
Configure the Opik-Harbor integration for Kilo Code benchmark runs:
- Set up
opik harbor runwith the Kilo Code adapter - Define standard LLM judge criteria for step-level evaluation:
- Tool choice correctness: Did the agent use the right tool at each step?
- Reasoning quality: Was the agent's reasoning at each step sound?
- Efficiency: Were there unnecessary or redundant steps?
- Create saved views for common comparison scenarios (model-vs-model, version-vs-version)
4. CI Regression Detection
Lower priority. Implement after the core benchmarking system is working.
Run a small subset of benchmark tasks (10-15) on release branches to catch regressions before shipping. Harbor supports this pattern natively. The subset should be chosen for:
- Fast execution (under 5 minutes per task)
- High signal (tasks that historically differentiate good and bad agent behavior)
- Stability (deterministic verification, not flaky)
Example Workflows
Comparing Models
Run the same Kilo Code agent against Terminal-Bench with different models:
# Run with Claude Opus opik harbor run -d terminal-bench@2.0 -a kilo -m anthropic/claude-opus-4 # Run with GPT-5 opik harbor run -d terminal-bench@2.0 -a kilo -m openai/gpt-5 # Run with Gemini 3 Pro opik harbor run -d terminal-bench@2.0 -a kilo -m google/gemini-3-pro
Compare results in tbench.ai for aggregate scores and in Opik for step-level analysis of where models diverge.
Comparing Agents
Run different agents against the same dataset with the same model:
# Run Kilo Code opik harbor run -d terminal-bench@2.0 -a kilo -m anthropic/claude-opus-4 # Run Claude Code opik harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4
Comparing Kilo Versions
Test a new release against the previous version:
# Run current release opik harbor run -d terminal-bench@2.0 -a kilo@v2.0 -m anthropic/claude-opus-4 # Run candidate release opik harbor run -d terminal-bench@2.0 -a kilo@v2.1-rc1 -m anthropic/claude-opus-4
Use Opik's trace comparison view to identify specific steps where the new version regressed or improved.
Running a Custom Task Set
# Run against a custom Kilo-specific dataset opik harbor run -d kilo-refactoring@1.0 -a kilo -m anthropic/claude-opus-4
LLM Judge: Two Levels
Harbor provides task-level judging (did the agent solve the task?). Opik adds step-level evaluation:
| Level | Tool | What It Tells You |
|---|---|---|
| Task-level | Harbor | Pass/fail, score, total time, total cost |
| Step-level | Opik | At step N, the agent chose tool X when it should have used tool Y. The reasoning was flawed because of Z. This step cost $0.03 and took 4 seconds. |
Step-level evaluation is where root cause debugging happens. When a benchmark score drops between versions, you can trace back to the exact decision point that caused the regression.
Relationship to Production Observability
This benchmarking system is complementary to, but separate from, the Agent Observability system:
| Concern | Benchmarking | Production Observability |
|---|---|---|
| Purpose | Offline evaluation of agent quality | Real-time monitoring of user sessions |
| Data source | Controlled benchmark tasks | Real user interactions |
| Tools | Harbor, Opik, tbench.ai | PostHog, custom metrics |
| When | Before release, on-demand | Continuously in production |
| Output | Leaderboard scores, trace comparisons | Alerts, dashboards, SLO tracking |