Live data from PinchBench ยท updated daily

Best Models for OpenClaw

Real-world benchmark data from PinchBench โ€” 23 standardised OpenClaw tasks, scored on success rate, speed, and cost.

Official runs only ยท 12 models benchmarked

23 real-world agent tasksAutomated + LLM judge gradingOfficial runs via PinchBench โ€” made by the Kilo team

Official runs only

Full rankings by average score

Sorted by average success rate across all official PinchBench runs.

๐Ÿฅ‡

trinity-large-thinking

arcee-ai

91.9%

Best

91.9%

Avg

11m 18s

Avg time

PinchBench
๐Ÿฅˆ

claude-opus-4-6

anthropic

100.0%

Best

89.6%

Avg

12m 11s

Avg time

PinchBench
๐Ÿฅ‰

grok-4.20-beta-0309-non-reasoning

xai

100.0%

Best

86.7%

Avg

10m 31s

Avg time

PinchBench
4

claude-sonnet-4-6

anthropic

95.0%

Best

84.9%

Avg

5m 8s

Avg time

PinchBench
5

claude-opus-4.6

anthropic

96.0%

Best

80.7%

Avg

25m 54s

Avg time

PinchBench
6

qwen3.5-27b

qwen

92.3%

Best

80.1%

Avg

19m 34s

Avg time

PinchBench
7

minimax-m2.7

minimax

91.9%

Best

79.7%

Avg

26m 53s

Avg time

PinchBench
8

auto

openrouter

100.0%

Best

67.0%

Avg

4m 31s

Avg time

PinchBench
9

gemini-3-flash-preview

google

97.8%

Best

65.9%

Avg

24m 46s

Avg time

PinchBench
10

claude-haiku-4-5

anthropic

98.3%

Best

64.8%

Avg

4m 45s

Avg time

PinchBench
11

gpt-5.4

openai

95.4%

Best

59.5%

Avg

34m 15s

Avg time

PinchBench
12

qwen3-235b-a22b-thinking-2507

qwen

100.0%

Best

54.6%

Avg

25m 34s

Avg time

PinchBench

How to choose

What matters for always-on agents

Average score over best score

A model that scores 90% once and 60% next time isn't production-ready. For 24/7 agents, average score across many runs is the metric that matters โ€” not a lucky peak.

Execution time per task

Slow models create bottlenecks when your agent handles hundreds of daily automations. Under 5 minutes per task is a practical target for interactive workflows.

Cost per run vs cost per month

An agent that fires 50 tasks a day accumulates real inference spend. Weigh per-run cost against your task volume โ€” free or cheap models might trade quality for savings.

Model spotlights

Best model for your use case

Always-on agents have different needs. Pick the right model for your workload.

Best overall
Top pick

Highest average success rate

91.9%avg score
trinity-large-thinking
arcee-ai

The most reliable choice for complex, multi-step agent tasks that run 24/7. Handles ambiguous instructions, tool-calling chains, and long-horizon planning consistently well across many runs โ€” not just in a lucky single attempt.

Try in KiloClaw
Fastest

Lowest average execution time

5m 8savg per task
claude-sonnet-4-6
anthropic

When your agent fires dozens of automations per day โ€” triaging email, checking CI, queuing reminders โ€” response latency compounds fast. This model delivers strong accuracy without the wait, keeping your agent snappy for real-time workflows.

Try in KiloClaw
Best value

Highest score-per-dollar

79.7%avg score ยท $0.111/run
minimax-m2.7
minimax

Running an always-on agent means paying for inference every time it acts. This model gives you the best success rate per dollar spent โ€” ideal when you're running high-volume automation pipelines or keeping costs predictable on a budget.

Try in KiloClaw
Most consistent

Smallest best-vs-average gap

10.1%score variance
claude-sonnet-4-6
anthropic

An agent you can set and forget needs to perform the same every time, not just occasionally. This model's scores stay tight across many independent runs โ€” making it the safest choice when reliability matters more than chasing a peak score.

Try in KiloClaw
๐Ÿฆ€

Try any of these models in your KiloClaw agent โ€” in one click

KiloClaw connects to 500+ models through Kilo Gateway at 0% markup. You can swap your OpenClaw model without touching config files.

Deploy your agent now

Firecracker VM, 500+ models, shell access, headless browser โ€” provisioned in under 5 minutes.

Start Free Trial