Skip to main content
Proprietary benchmark suite

KiloBench

Performance data for AI coding models officially evaluated by the Kilo team on Terminal Bench 2.0. We run evaluations using our actual agent harness to measure harness-specific pass rates and true production costs — because your benchmark score doesn't pay the API bills.

Kilo Bench

Cost vs performance across the most capable coding models

Cost vs Performance

Popular

Top 10 Most Capable Models

RankModelCompletionCost per attempt
174.2%$72.63
270.1%$100.51
367.6%$85.19
464.7%$104.49
560.7%$32.94
655.1%$53.37
754.6%$20.65
854.4%$24.84
953.0%$26.21
1050.6%$30.70

Official Kilo eval results on Terminal Bench 2.0. Cost and token usage are averaged per complete benchmark attempt.

Official KiloBench Results

Officially published KiloBench scores — click any column header to sort

Model
174.2%$72.63
270.1%$100.51
367.6%$85.19
464.7%$104.49
560.7%$32.94
655.1%$53.37
754.6%$20.65
854.4%$24.84
953.0%$26.21
1050.6%$30.70
1149.4%$23.98
1247.6%$4.92
1347.6%$10.35
1446.7%$19.60
1544.0%$15.91
1628.1%$30.82
1725.4%$0.00
1819.1%$101.82
1915.5%$0.00

Results at a glance

How to read KiloBench results

Eval suite
Terminal Bench 2.0
Harness
Kilo's actual agent framework
Tracked metrics
Completion %, cost per attempt
  1. Completion % is the primary signal

    Higher is better. It measures the fraction of benchmark tasks the model completed end-to-end through Kilo's harness — not a generic scaffold.

  2. Cost per attempt reflects real bills

    Sticker per-token pricing tells you almost nothing. These costs include reasoning tokens, cumulative context re-sends, and agent loop overhead from Kilo's actual pipeline.

  3. Bubble size = Kilo popularity

    Larger bubbles mean more real-world developer token usage in Kilo Code last week, so you can see what the community actually reaches for.

  4. Tested end-to-end, not per-task

    Each model is run as a full agent — planning, tool use, multi-step execution, and self-correction — across all 89 Terminal Bench 2.0 tasks per trial.

The benchmark ceiling

Generic benchmarks answer generic questions. We needed more.

Industry leaderboards like SWE-bench Verified are approaching saturation — the top six models are separated by 1.3 percentage points. Scaffold inflation — where vendors optimize agent orchestration layers solely to gamify benchmarks — makes raw scores highly misleading. OpenAI stopped reporting SWE-bench Verified entirely.

No standard leaderboard measures cost per task or harness fit. A model scoring 80% at $2/attempt may be worse than one scoring 75% at $0.20/attempt. KiloBench exists to answer that question for the framework that matters: yours.

Eval harness
Terminal Bench 2.0

What KiloBench measures that others don't

  1. Harness-specific pass rates

    The same model scores differently depending on which agent framework wraps it. KiloBench measures performance under Kilo's real tools, context pipeline, and retry logic.

  2. True cost per attempt

    Reasoning tokens are billed at output rates but never shown. Agent loops re-sent the same context 20+ times. KiloBench captures all of it — the actual bill, not the sticker price.

  3. Cost to complete

    A model that is cheap per trial but needs five attempts is more expensive than one that costs more but resolves the task on the first run. KiloBench tracks both.

  4. Behavioral fingerprints

    Models differ in how they work — some read extensively before writing, others sustain 1,000+ tool calls. These behavior patterns matter for cost and only emerge under harness testing.

Auto Model Routing

Don't want to choose? Let Kilo choose for you.

With Auto Model Routing, you never have to pick a model manually again. Kilo automatically routes each request to the right model for the task — balancing quality, speed, and cost on the fly.

Under the hood, the routing engine is powered by two data sources: KiloBench benchmark scores (the data on this page) and real-world Leaderboard usage signals from hundreds of thousands of developers. Together they give the router a full picture of each model's capabilities and popularity before it makes a decision.

KiloBench data

Completion %, cost, speed

Leaderboard signals

Real developer usage

Auto Model Routing

Best model for each request