Live data from PinchBench · updated daily

Best Models for OpenClaw

Real-world benchmark data from PinchBench — 23 standardised OpenClaw tasks, scored on success rate, speed, and cost.

Official runs only · 50 models benchmarked

23 real-world agent tasksAutomated + LLM judge gradingOfficial runs via PinchBench — made by the Kilo team

Official runs only

Full rankings by average score

Sorted by average success rate across all official PinchBench runs.

🥇

claude-opus-4.8-fast

anthropic

94.5%

Best

93.5%

Avg

152m 49s

Avg time

PinchBench
🥈

qwen3.7-max

qwen

93.4%

Best

92.5%

Avg

200m 5s

Avg time

PinchBench
🥉

kimi-k2.6

moonshotai

100.0%

Best

91.0%

Avg

1m 19s

Avg time

PinchBench
4

claude-opus-4.8

anthropic

91.8%

Best

90.5%

Avg

244m 50s

Avg time

PinchBench
5

nemotron-3-ultra-550b-a55b

nvidia

90.6%

Best

89.9%

Avg

150m 34s

Avg time

PinchBench
6

mimo-v2.5

xiaomi

91.9%

Best

89.7%

Avg

197m 30s

Avg time

PinchBench
7

grok-build-0.1

x-ai

92.1%

Best

88.9%

Avg

220m 40s

Avg time

PinchBench
8

qwen3.6-flash

qwen

89.1%

Best

88.1%

Avg

222m 53s

Avg time

PinchBench
9

mimo-v2.5-pro

xiaomi

89.5%

Best

87.5%

Avg

251m 19s

Avg time

PinchBench
10

qwen3.6-plus-preview

qwen

88.6%

Best

84.0%

Avg

22m 36s

Avg time

PinchBench
11

gemini-3-pro-preview

google

95.1%

Best

83.4%

Avg

14m 54s

Avg time

PinchBench
12

qwen3-coder-next

qwen

86.0%

Best

83.3%

Avg

6m 49s

Avg time

PinchBench
13

claude-sonnet-4.5

anthropic

94.7%

Best

83.1%

Avg

17m 14s

Avg time

PinchBench
14

minimax-m2.1

minimax

95.1%

Best

82.7%

Avg

19m 51s

Avg time

PinchBench
15

deepseek-v4-flash

deepseek

91.5%

Best

81.7%

Avg

281m 38s

Avg time

PinchBench
16

claude-opus-4.5

anthropic

94.7%

Best

81.1%

Avg

17m 7s

Avg time

PinchBench
17

qwen3.5-397b-a17b

qwen

89.1%

Best

80.5%

Avg

18m 16s

Avg time

PinchBench
18

mimo-v2-pro

xiaomi

87.4%

Best

80.0%

Avg

21m 44s

Avg time

PinchBench
19

qwen3.5-27b

qwen

90.0%

Best

79.2%

Avg

19m 30s

Avg time

PinchBench
20

glm-4.5-air

z-ai

87.3%

Best

78.2%

Avg

26m 16s

Avg time

PinchBench

How to choose

What matters for always-on agents

Average score over best score

A model that scores 90% once and 60% next time isn't production-ready. For 24/7 agents, average score across many runs is the metric that matters — not a lucky peak.

Execution time per task

Slow models create bottlenecks when your agent handles hundreds of daily automations. Under 5 minutes per task is a practical target for interactive workflows.

Cost per run vs cost per month

An agent that fires 50 tasks a day accumulates real inference spend. Weigh per-run cost against your task volume — free or cheap models might trade quality for savings.

Model spotlights

Best model for your use case

Always-on agents have different needs. Pick the right model for your workload.

Best overall
Top pick

Highest average success rate

93.5%avg score
claude-opus-4.8-fast
anthropic

The most reliable choice for complex, multi-step agent tasks that run 24/7. Handles ambiguous instructions, tool-calling chains, and long-horizon planning consistently well across many runs — not just in a lucky single attempt.

Try in KiloClaw
Fastest

Lowest average execution time

1m 19savg per task
kimi-k2.6
moonshotai

When your agent fires dozens of automations per day — triaging email, checking CI, queuing reminders — response latency compounds fast. This model delivers strong accuracy without the wait, keeping your agent snappy for real-time workflows.

Try in KiloClaw
Best value

Highest score-per-dollar

91.0%avg score · $0.010/run
kimi-k2.6
moonshotai

Running an always-on agent means paying for inference every time it acts. This model gives you the best success rate per dollar spent — ideal when you're running high-volume automation pipelines or keeping costs predictable on a budget.

Try in KiloClaw
Most consistent

Smallest best-vs-average gap

0.7%score variance
nemotron-3-ultra-550b-a55b
nvidia

An agent you can set and forget needs to perform the same every time, not just occasionally. This model's scores stay tight across many independent runs — making it the safest choice when reliability matters more than chasing a peak score.

Try in KiloClaw
🦀

Try any of these models in your KiloClaw agent — in one click

KiloClaw connects to 500+ models through Kilo Gateway at 0% markup. You can swap your OpenClaw model without touching config files.

Deploy your agent now

Firecracker VM, 500+ models, shell access, headless browser — provisioned in under 5 minutes.

Start Free Trial