Live data from PinchBench ยท updated daily

Best Models for OpenClaw

Real-world benchmark data from PinchBench โ€” 23 standardised OpenClaw tasks, scored on success rate, speed, and cost.

Official runs only ยท 41 models benchmarked

23 real-world agent tasksAutomated + LLM judge gradingOfficial runs via PinchBench โ€” made by the Kilo team

Official runs only

Full rankings by average score

Sorted by average success rate across all official PinchBench runs.

๐Ÿฅ‡

mimo-v2.5

xiaomi

89.5%

Best

88.7%

Avg

204m 14s

Avg time

PinchBench
๐Ÿฅˆ

mimo-v2.5-pro

xiaomi

89.5%

Best

87.7%

Avg

252m 3s

Avg time

PinchBench
๐Ÿฅ‰

ling-2.6-1t

inclusionai

82.6%

Best

82.6%

Avg

239m 57s

Avg time

PinchBench
4

gemini-3.1-pro-preview

google

82.9%

Best

80.8%

Avg

163m 38s

Avg time

PinchBench
5

gemini-3.1-flash-lite

google

80.5%

Best

80.5%

Avg

93m 54s

Avg time

PinchBench
6

kimi-k2.6

moonshotai

79.8%

Best

79.8%

Avg

31m 10s

Avg time

PinchBench
7

grok-4.20

x-ai

84.6%

Best

79.7%

Avg

210m 48s

Avg time

PinchBench
8

step-3.5-flash

stepfun

84.7%

Best

79.5%

Avg

238m 57s

Avg time

PinchBench
9

gpt-5.4-mini

openai

82.4%

Best

78.8%

Avg

197m 43s

Avg time

PinchBench
10

deepseek-v4-flash

deepseek

84.6%

Best

77.7%

Avg

263m 21s

Avg time

PinchBench
11

gpt-5.5

openai

100.0%

Best

75.7%

Avg

190m 53s

Avg time

PinchBench
12

gpt-5.4

openai

86.0%

Best

75.6%

Avg

270m 15s

Avg time

PinchBench
13

claude-opus-4.7

anthropic

91.6%

Best

74.6%

Avg

259m 39s

Avg time

PinchBench
14

glm-5.1

z-ai

82.8%

Best

73.1%

Avg

293m 30s

Avg time

PinchBench
15

grok-4.3

x-ai

83.1%

Best

72.2%

Avg

201m 37s

Avg time

PinchBench
16

gemini-3-flash-preview

google

79.3%

Best

71.4%

Avg

207m 51s

Avg time

PinchBench
17

seed-2.0-lite

bytedance-seed

86.2%

Best

71.1%

Avg

262m 23s

Avg time

PinchBench
18

claude-opus-4.6

anthropic

88.9%

Best

71.0%

Avg

288m 2s

Avg time

PinchBench
19

glm-5-turbo

z-ai

86.3%

Best

70.0%

Avg

296m 25s

Avg time

PinchBench
20

qwen3.6-max-preview

qwen

69.3%

Best

69.3%

Avg

19m 39s

Avg time

PinchBench

How to choose

What matters for always-on agents

Average score over best score

A model that scores 90% once and 60% next time isn't production-ready. For 24/7 agents, average score across many runs is the metric that matters โ€” not a lucky peak.

Execution time per task

Slow models create bottlenecks when your agent handles hundreds of daily automations. Under 5 minutes per task is a practical target for interactive workflows.

Cost per run vs cost per month

An agent that fires 50 tasks a day accumulates real inference spend. Weigh per-run cost against your task volume โ€” free or cheap models might trade quality for savings.

Model spotlights

Best model for your use case

Always-on agents have different needs. Pick the right model for your workload.

Best overall
Top pick

Highest average success rate

88.7%avg score
mimo-v2.5
xiaomi

The most reliable choice for complex, multi-step agent tasks that run 24/7. Handles ambiguous instructions, tool-calling chains, and long-horizon planning consistently well across many runs โ€” not just in a lucky single attempt.

Try in KiloClaw
Fastest

Lowest average execution time

31m 10savg per task
kimi-k2.6
moonshotai

When your agent fires dozens of automations per day โ€” triaging email, checking CI, queuing reminders โ€” response latency compounds fast. This model delivers strong accuracy without the wait, keeping your agent snappy for real-time workflows.

Try in KiloClaw
Best value

Highest score-per-dollar

79.5%avg score ยท $0.788/run
step-3.5-flash
stepfun

Running an always-on agent means paying for inference every time it acts. This model gives you the best success rate per dollar spent โ€” ideal when you're running high-volume automation pipelines or keeping costs predictable on a budget.

Try in KiloClaw
Most consistent

Smallest best-vs-average gap

0.8%score variance
mimo-v2.5
xiaomi

An agent you can set and forget needs to perform the same every time, not just occasionally. This model's scores stay tight across many independent runs โ€” making it the safest choice when reliability matters more than chasing a peak score.

Try in KiloClaw
๐Ÿฆ€

Try any of these models in your KiloClaw agent โ€” in one click

KiloClaw connects to 500+ models through Kilo Gateway at 0% markup. You can swap your OpenClaw model without touching config files.

Deploy your agent now

Firecracker VM, 500+ models, shell access, headless browser โ€” provisioned in under 5 minutes.

Start Free Trial