Real-world benchmark data from PinchBench — 23 standardised OpenClaw tasks, scored on success rate, speed, and cost.
Official runs only · 50 models benchmarked
Official runs only
Sorted by average success rate across all official PinchBench runs.
claude-opus-4.8-fast
anthropic
qwen3.7-max
qwen
kimi-k2.6
moonshotai
claude-opus-4.8
anthropic
nemotron-3-ultra-550b-a55b
nvidia
mimo-v2.5
xiaomi
grok-build-0.1
x-ai
qwen3.6-flash
qwen
mimo-v2.5-pro
xiaomi
qwen3.6-plus-preview
qwen
gemini-3-pro-preview
qwen3-coder-next
qwen
claude-sonnet-4.5
anthropic
minimax-m2.1
minimax
deepseek-v4-flash
deepseek
claude-opus-4.5
anthropic
qwen3.5-397b-a17b
qwen
mimo-v2-pro
xiaomi
qwen3.5-27b
qwen
glm-4.5-air
z-ai
How to choose
A model that scores 90% once and 60% next time isn't production-ready. For 24/7 agents, average score across many runs is the metric that matters — not a lucky peak.
Slow models create bottlenecks when your agent handles hundreds of daily automations. Under 5 minutes per task is a practical target for interactive workflows.
An agent that fires 50 tasks a day accumulates real inference spend. Weigh per-run cost against your task volume — free or cheap models might trade quality for savings.
Model spotlights
Always-on agents have different needs. Pick the right model for your workload.
The most reliable choice for complex, multi-step agent tasks that run 24/7. Handles ambiguous instructions, tool-calling chains, and long-horizon planning consistently well across many runs — not just in a lucky single attempt.
Try in KiloClawWhen your agent fires dozens of automations per day — triaging email, checking CI, queuing reminders — response latency compounds fast. This model delivers strong accuracy without the wait, keeping your agent snappy for real-time workflows.
Try in KiloClawRunning an always-on agent means paying for inference every time it acts. This model gives you the best success rate per dollar spent — ideal when you're running high-volume automation pipelines or keeping costs predictable on a budget.
Try in KiloClawAn agent you can set and forget needs to perform the same every time, not just occasionally. This model's scores stay tight across many independent runs — making it the safest choice when reliability matters more than chasing a peak score.
Try in KiloClawKiloClaw connects to 500+ models through Kilo Gateway at 0% markup. You can swap your OpenClaw model without touching config files.
Firecracker VM, 500+ models, shell access, headless browser — provisioned in under 5 minutes.
Start Free Trial