Proprietary benchmark suite

KiloBench

Performance data for AI coding models officially evaluated by the Kilo team on Terminal Bench 2.0. We run evaluations using our actual agent harness to measure harness-specific pass rates and true production costs — because your benchmark score doesn't pay the API bills.

AnthropicOpenAIGooglexAIDeepSeekQwenMoonshotMistralMiniMaxTencent HunyuanZ.aiInclusionAIKwaipilotPoolsideStepFunNVIDIA

Cost vs Performance All Results Auto Model Routing Compare Models Live Leaderboard

Kilo Benchmark

Cost vs performance across the best coding models

Cost vs Performance

Popular

Top 10 Most Capable Models

Rank	Model	Completion	Cost per attempt
1	GPT-5.5	74.2%	$72.63
2	Claude Fable 5 ($$$$)	71.0%	$87.52
3	Grok 4.5	70.8%	$27.29
4	GPT-5.6 Sol	70.3%	$58.99
5	Claude Opus 4.7	70.1%	$100.51
6	Claude Opus 4.8	67.6%	$85.19
7	Gemini 3.5 Flash	64.7%	$104.49
8	Kimi K2.7 Code	60.7%	$32.94
9	MMuse Spark 1.1	59.8%	$30.15
10	Claude Sonnet 5	59.6%	$36.19

Official Kilo eval results on Terminal Bench 2.0. Cost and token usage are averaged per complete benchmark attempt.

Official KiloBench Results

Officially published KiloBench scores — click any column header to sort

	Model
1	GPT-5.5	74.2%	$72.63
2	Claude Fable 5 ($$$$)	71.0%	$87.52
3	Grok 4.5	70.8%	$27.29
4	GPT-5.6 Sol	70.3%	$58.99
5	Claude Opus 4.7	70.1%	$100.51
6	Claude Opus 4.8	67.6%	$85.19
7	Gemini 3.5 Flash	64.7%	$104.49
8	Kimi K2.7 Code	60.7%	$32.94
9	MMuse Spark 1.1	59.8%	$30.15
10	Claude Sonnet 5	59.6%	$36.19
11	Claude Sonnet 4.6	55.1%	$53.37
12	Qwen3.7 Max (50% off)	54.6%	$20.65
13	Kimi K2.6	54.4%	$24.84
14	GLM 5.2	53.0%	$26.21
15	Grok Build 0.1	50.6%	$30.70
16	KAT-Coder-Pro V2.5	50.3%	$36.16
17	GLM 5.1	49.4%	$23.98
18	Hy3 (free)	47.6%	$0.00
19	MiMo-V2.5-Pro	47.6%	$4.92
20	MiniMax M3	47.6%	$10.35
21	Kkilo-auto/efficient	46.7%	$19.60
22	DeepSeek V4 Pro	44.0%	$15.91
23	Laguna M.1 (retires Jul 28)	32.6%	$40.55
24	Ling-2.6-1T	28.1%	$30.82
25	Laguna XS 2.1	26.7%	$12.03
26	Laguna M.1 (free)	25.4%	$0.00
27	Nemotron 3 Ultra	19.1%	$101.82
28	Nemotron 3 Super (free)	15.5%	$0.00

Results at a glance

How to read KiloBench results

Eval suite: Terminal Bench 2.0
Harness: Kilo's actual agent framework
Tracked metrics: Completion %, cost per attempt

Completion % is the primary signal
Higher is better. It measures the fraction of benchmark tasks the model completed end-to-end through Kilo's harness — not a generic scaffold.
Cost per attempt reflects real bills
Sticker per-token pricing tells you almost nothing. These costs include reasoning tokens, cumulative context re-sends, and agent loop overhead from Kilo's actual pipeline.
Bubble size = Kilo popularity
Larger bubbles mean more real-world developer token usage in Kilo Code last week, so you can see what the community actually reaches for.
Tested end-to-end, not per-task
Each model is run as a full agent — planning, tool use, multi-step execution, and self-correction — across all 89 Terminal Bench 2.0 tasks per trial.

The benchmark ceiling

Generic benchmarks answer generic questions. We needed more.

Industry leaderboards like SWE-bench Verified are approaching saturation — the top six models are separated by 1.3 percentage points. Scaffold inflation — where vendors optimize agent orchestration layers solely to gamify benchmarks — makes raw scores highly misleading. OpenAI stopped reporting SWE-bench Verified entirely.

No standard leaderboard measures cost per task or harness fit. A model scoring 80% at $2/attempt may be worse than one scoring 75% at $0.20/attempt. KiloBench exists to answer that question for the framework that matters: yours.

Eval harness: Terminal Bench 2.0

What KiloBench measures that others don't

Harness-specific pass rates
The same model scores differently depending on which agent framework wraps it. KiloBench measures performance under Kilo's real tools, context pipeline, and retry logic.
True cost per attempt
Reasoning tokens are billed at output rates but never shown. Agent loops re-sent the same context 20+ times. KiloBench captures all of it — the actual bill, not the sticker price.
Cost to complete
A model that is cheap per trial but needs five attempts is more expensive than one that costs more but resolves the task on the first run. KiloBench tracks both.
Behavioral fingerprints
Models differ in how they work — some read extensively before writing, others sustain 1,000+ tool calls. These behavior patterns matter for cost and only emerge under harness testing.

Auto Model Routing

Don't want to choose? Let Kilo choose for you.

With Auto Model Routing, you never have to pick a model manually again. Kilo automatically routes each request to the right model for the task — balancing quality, speed, and cost on the fly.

Under the hood, the routing engine is powered by two data sources: KiloBench benchmark scores (the data on this page) and real-world Leaderboard usage signals from hundreds of thousands of developers. Together they give the router a full picture of each model's capabilities and popularity before it makes a decision.

KiloBench data

Completion %, cost, speed

Leaderboard signals

Real developer usage

Auto Model Routing

Best model for each request

Try Auto Model Routing

Learn how it works →

KiloBench

Kilo Benchmark

Cost vs Performance

Top 10 Most Capable Models

Official KiloBench Results

How to read KiloBench results

Completion % is the primary signal

Cost per attempt reflects real bills

Bubble size = Kilo popularity

Tested end-to-end, not per-task

Generic benchmarks answer generic questions. We needed more.

What KiloBench measures that others don't

Harness-specific pass rates

True cost per attempt

Cost to complete

Behavioral fingerprints

Don't want to choose? Let Kilo choose for you.