KiloBench
Performance data for AI coding models officially evaluated by the Kilo team on Terminal Bench 2.0. We run evaluations using our actual agent harness to measure harness-specific pass rates and true production costs — because your benchmark score doesn't pay the API bills.
Kilo Bench
Cost vs performance across the most capable coding models
Cost vs Performance
Top 10 Most Capable Models
| Rank | Model | Completion | Cost per attempt |
|---|---|---|---|
| 1 | 74.2% | $72.63 | |
| 2 | 70.1% | $100.51 | |
| 3 | 67.6% | $85.19 | |
| 4 | 64.7% | $104.49 | |
| 5 | 60.7% | $32.94 | |
| 6 | 55.1% | $53.37 | |
| 7 | 54.6% | $20.65 | |
| 8 | 54.4% | $24.84 | |
| 9 | 53.0% | $26.21 | |
| 10 | 50.6% | $30.70 |
Official Kilo eval results on Terminal Bench 2.0. Cost and token usage are averaged per complete benchmark attempt.
Official KiloBench Results
Officially published KiloBench scores — click any column header to sort
| Model | |||
|---|---|---|---|
| 1 | 74.2% | $72.63 | |
| 2 | 70.1% | $100.51 | |
| 3 | 67.6% | $85.19 | |
| 4 | 64.7% | $104.49 | |
| 5 | 60.7% | $32.94 | |
| 6 | 55.1% | $53.37 | |
| 7 | 54.6% | $20.65 | |
| 8 | 54.4% | $24.84 | |
| 9 | 53.0% | $26.21 | |
| 10 | 50.6% | $30.70 | |
| 11 | 49.4% | $23.98 | |
| 12 | 47.6% | $4.92 | |
| 13 | 47.6% | $10.35 | |
| 14 | 46.7% | $19.60 | |
| 15 | 44.0% | $15.91 | |
| 16 | 28.1% | $30.82 | |
| 17 | 25.4% | $0.00 | |
| 18 | 19.1% | $101.82 | |
| 19 | 15.5% | $0.00 |
Results at a glance
How to read KiloBench results
- Eval suite
- Terminal Bench 2.0
- Harness
- Kilo's actual agent framework
- Tracked metrics
- Completion %, cost per attempt
Completion % is the primary signal
Higher is better. It measures the fraction of benchmark tasks the model completed end-to-end through Kilo's harness — not a generic scaffold.
Cost per attempt reflects real bills
Sticker per-token pricing tells you almost nothing. These costs include reasoning tokens, cumulative context re-sends, and agent loop overhead from Kilo's actual pipeline.
Bubble size = Kilo popularity
Larger bubbles mean more real-world developer token usage in Kilo Code last week, so you can see what the community actually reaches for.
Tested end-to-end, not per-task
Each model is run as a full agent — planning, tool use, multi-step execution, and self-correction — across all 89 Terminal Bench 2.0 tasks per trial.
The benchmark ceiling
Generic benchmarks answer generic questions. We needed more.
Industry leaderboards like SWE-bench Verified are approaching saturation — the top six models are separated by 1.3 percentage points. Scaffold inflation — where vendors optimize agent orchestration layers solely to gamify benchmarks — makes raw scores highly misleading. OpenAI stopped reporting SWE-bench Verified entirely.
No standard leaderboard measures cost per task or harness fit. A model scoring 80% at $2/attempt may be worse than one scoring 75% at $0.20/attempt. KiloBench exists to answer that question for the framework that matters: yours.
- Eval harness
- Terminal Bench 2.0
What KiloBench measures that others don't
Harness-specific pass rates
The same model scores differently depending on which agent framework wraps it. KiloBench measures performance under Kilo's real tools, context pipeline, and retry logic.
True cost per attempt
Reasoning tokens are billed at output rates but never shown. Agent loops re-sent the same context 20+ times. KiloBench captures all of it — the actual bill, not the sticker price.
Cost to complete
A model that is cheap per trial but needs five attempts is more expensive than one that costs more but resolves the task on the first run. KiloBench tracks both.
Behavioral fingerprints
Models differ in how they work — some read extensively before writing, others sustain 1,000+ tool calls. These behavior patterns matter for cost and only emerge under harness testing.
Auto Model Routing
Don't want to choose? Let Kilo choose for you.
With Auto Model Routing, you never have to pick a model manually again. Kilo automatically routes each request to the right model for the task — balancing quality, speed, and cost on the fly.
Under the hood, the routing engine is powered by two data sources: KiloBench benchmark scores (the data on this page) and real-world Leaderboard usage signals from hundreds of thousands of developers. Together they give the router a full picture of each model's capabilities and popularity before it makes a decision.
KiloBench data
Completion %, cost, speed
Leaderboard signals
Real developer usage
Auto Model Routing
Best model for each request