Real-world benchmark data from PinchBench โ 23 standardised OpenClaw tasks, scored on success rate, speed, and cost.
Official runs only ยท 12 models benchmarked
Official runs only
Sorted by average success rate across all official PinchBench runs.
trinity-large-thinking
arcee-ai
claude-opus-4-6
anthropic
grok-4.20-beta-0309-non-reasoning
xai
claude-sonnet-4-6
anthropic
claude-opus-4.6
anthropic
qwen3.5-27b
qwen
minimax-m2.7
minimax
auto
openrouter
gemini-3-flash-preview
claude-haiku-4-5
anthropic
gpt-5.4
openai
qwen3-235b-a22b-thinking-2507
qwen
How to choose
A model that scores 90% once and 60% next time isn't production-ready. For 24/7 agents, average score across many runs is the metric that matters โ not a lucky peak.
Slow models create bottlenecks when your agent handles hundreds of daily automations. Under 5 minutes per task is a practical target for interactive workflows.
An agent that fires 50 tasks a day accumulates real inference spend. Weigh per-run cost against your task volume โ free or cheap models might trade quality for savings.
Model spotlights
Always-on agents have different needs. Pick the right model for your workload.
The most reliable choice for complex, multi-step agent tasks that run 24/7. Handles ambiguous instructions, tool-calling chains, and long-horizon planning consistently well across many runs โ not just in a lucky single attempt.
Try in KiloClawWhen your agent fires dozens of automations per day โ triaging email, checking CI, queuing reminders โ response latency compounds fast. This model delivers strong accuracy without the wait, keeping your agent snappy for real-time workflows.
Try in KiloClawRunning an always-on agent means paying for inference every time it acts. This model gives you the best success rate per dollar spent โ ideal when you're running high-volume automation pipelines or keeping costs predictable on a budget.
Try in KiloClawAn agent you can set and forget needs to perform the same every time, not just occasionally. This model's scores stay tight across many independent runs โ making it the safest choice when reliability matters more than chasing a peak score.
Try in KiloClawKiloClaw connects to 500+ models through Kilo Gateway at 0% markup. You can swap your OpenClaw model without touching config files.
Firecracker VM, 500+ models, shell access, headless browser โ provisioned in under 5 minutes.
Start Free Trial