How to Reduce OpenClaw API Costs: 7 fixes that save 80%+

Q: Can I set a hard spend cap that automatically stops OpenClaw API calls?

Not natively. The budget-monitoring pattern (cron + session_status + alerts) gives visibility and soft limits. For a hard stop, you need a proxy layer with budget-aware virtual keys (LiteLLM supports this) or a custom skill that disables the gateway when spend crosses a threshold.

Reducing OpenClaw API costs starts with a diagnostic most teams skip — and the fix is usually mechanical. One operator recently reported 135M tokens of OpenClaw token burn in about 24 hours, with nothing to show for it by morning: the instance had forgotten its own configuration, forgotten its tasks, and spent the night talking to itself.

That is not an edge case. It is the default shape of an unmonitored OpenClaw install.

The good news: almost all of it is recoverable. Most bills are 2–3× larger than they need to be, and the fixes are fairly mechanical. This guide walks through them in order of impact — starting with the diagnostic, then the runaway loops, then the structural work.

Daily token spend

Before and after an OpenClaw cost audit

Before auditAfter audit

Before

112M/day

After

17M/day

Reduction

85%

TL;DR — the fast fixes (save ~80%)

Pin heartbeats to a cheap model (Gemini Flash, DeepSeek V3, or a local Ollama model) instead of whatever Sonnet/Opus your agent defaults to.
Bind the gateway to 127.0.0.1 so nothing on the internet can trigger an OpenClaw runaway loop from outside the host.
Cap MEMORY.md under 80 lines — memory.md size compounds on every turn because the file is re-injected into every single call.
Disable the managed browser in openclaw.json and replace it with a lightweight browser skill invoked only when you need it.
Set per-task spend caps and require human confirmation before retries so a failed tool call can't loop into a four-figure bill.
Run /context detail weekly to catch tools, skills, and workspace files that crept into your prompt since last Monday.

Full step-by-step below, with config snippets.

Where your OpenClaw spend actually goes

Before changing anything, build the mental model. OpenClaw costs are almost entirely LLM API calls, and each call is billed as input tokens + output tokens × price per million. The input token count on any given call is not just your message — it is your message plus the full conversation history, plus every workspace file injected on that turn (AGENTS.md, TOOLS.md, IDENTITY.md, SOUL.md, MEMORY.md, HEARTBEAT.md), plus every enabled tool's JSON schema, plus the system prompt.

Four things drive almost all OpenClaw cost:

Model choice. The biggest lever by far. Claude Opus runs roughly 50× more per input token than Gemini Flash. Using a premium model for a heartbeat that reads a status file and replies OK is burning money. See the Kilo pricing page for how the per-token economics stack up across providers.
Context accumulation. Every turn re-sends the full conversation history. A 40-turn session is paying to reprocess early messages 40 times.
Unbounded loops. Heartbeats firing every few minutes, retries stacking on 429s, orphaned subagents, gateways reachable from outside the host. Any of these will produce the classic OpenClaw token burn pattern where spend triples overnight with zero output to show for it.
Coordination overhead. Multi-agent setups duplicate context across specialists on every handoff. A verbose coordinator multiplies cost with every delegation.

Before touching anything else, run:

/context detail

That gives you a per-file and per-tool token breakdown so you can see which of the four drivers is actually hurting on your instance. (OpenClaw docs: /context detail.) Do this first. Everything below is more effective when you know where the tokens are going.

Step 1: Kill the runaway loops

The cheapest wins are the loops you did not mean to start. An OpenClaw runaway loop rarely announces itself — it just shows up on the next invoice.

Lock down the gateway. If the gateway is bound to 0.0.0.0 or exposed behind a reverse proxy, anything on the internet can hit it. It should be loopback-only unless you have a deliberate reason otherwise. On the VPS:

ss -ltnp | grep 18789 || netstat -an | grep 18789

You want 127.0.0.1:18789, not 0.0.0.0:18789. Then run:

openclaw security audit --deep

and fix any "gateway exposed" findings it surfaces. (See the OpenClaw security audit docs for the full list of checks.)

Cap heartbeats. Default heartbeat configuration fires on whatever cadence it was installed with, using whatever model your agent defaults to. If that is Sonnet every five minutes, you are paying Sonnet prices for a check that wants to read a status file and return HEARTBEAT_OK. Move heartbeats into their own HEARTBEAT.md with a strict checklist — two to four checks per day, not continuous — and pin them to a cheap model:

openclaw cron add --every 6h --session isolated --model google/gemini-flash \
  "Read HEARTBEAT.md; if all systems normal reply HEARTBEAT_OK. If anomaly found, alert coordinator."

Or disable the default entirely and replace it with explicit cron jobs you control:

agents:
  defaults:
    heartbeat:
      every: "0"    # disabled

Set hard caps. OpenClaw does not have a native hard spend cap that stops API calls at a dollar threshold. What it does have is enough config to stop runaway retries cold. Set a per-task token budget with prices configured, disable re-reasoning on failure, and require human confirmation before any retry. Daily spend caps at the provider level (Anthropic, OpenAI, OpenRouter all support these) are the belt-and-braces layer.

Clean up orphaned subagents. Any subagent spawned without cleanup=delete can keep running in the background after its parent task ends. Set cleanup explicitly and cap subagent depth to prevent runaway fan-out.

Step 2: Tier your models

The single most impactful change most setups can make is not using the same model for everything. Match model to task:

Free or local — Ollama-hosted Qwen 2.5, Llama 3.2, or Mistral running on the VPS for heartbeats, simple routing, and status checks. Zero per-call cost beyond compute you are already paying for. A 7B model in 4-bit quantisation needs roughly 4–5 GB of RAM.
Budget ($0.10–$0.50/M input) — Gemini Flash, Claude Haiku, DeepSeek V3. Fast and cheap, capable enough for classification, summarisation, and light research. DeepSeek V3 via OpenRouter is the standout for research and summarisation work — cheap per token, no daily cap.
Mid ($1–$5/M input) — Claude Sonnet, GPT-4o Mini, Gemini Pro. Use for tasks that need real reasoning: code review, multi-step planning, research synthesis.
Premium ($10+/M input) — Claude Opus, GPT-4o, Gemini Ultra. Reserve for the specific tasks that genuinely need them.

Define an allowlist in agents.defaults.models so cheap models are selectable and premium models have to be asked for explicitly:

agents:
  defaults:
    models:
      - "google/gemini-1.5-flash-latest"
      - "deepseek/deepseek-chat-v3"
      - "anthropic/claude-haiku-3-5"
      - "anthropic/claude-sonnet-4"

Then assign models per cron job and per agent rather than defaulting:

{
  "name": "daily-research-digest",
  "schedule": "0 9 * * *",
  "model": "deepseek/deepseek-chat-v3",
  "session": "isolated"
}

For overnight cron work specifically, route to the cheapest model in the app. Operators consistently report around 90% savings just from that one change. For a deeper look at the split between OpenClaw defaults and managed alternatives, see our breakdown of managed OpenClaw hosts.

Step 3: Clean your workspace files

Every file in ~/.openclaw/workspace/ gets injected on every single turn. Most were set up months ago and never opened since.

Open each of them. AGENTS.md. TOOLS.md. IDENTITY.md. SOUL.md. MEMORY.md. HEARTBEAT.md. If a line is not doing real work, delete it.

Two specific patterns that pay back fast:

Keep MEMORY.md under 80 lines. memory.md size is the variable most operators underestimate — because the file is re-injected on every turn, doubling the file doubles the cost of every call for the rest of the session. Only the content the agent genuinely needs in more than half your sessions belongs there. Everything else goes into daily logs — searchable, but not auto-loaded. When a session accumulates useful state, summarise it into MEMORY.md before you /reset, so the next session starts with the compressed version rather than the full history.
Consolidate identity files into one. Point MEMORY.md, SOUL.md, AGENTS.md, USER.md, TOOLS.md, and IDENTITY.md all at a single file named for your agent. One operator reported 3–5K tokens saved per turn from this alone, and it future-proofs against OpenClaw updates that rename or restructure the individual files.

Step 4: Prune tool schemas

Tool schemas are the silent tax. Every enabled tool ships its full JSON schema on every message. If your agent has 20 tools defined and uses three, the other 17 definitions are inflating every call.

In the Control UI, go to Agents → Tool Access and drop the "Full" preset. Select tools explicitly. Disable the generators you are not actually using — video_generate, music_generate, tts, canvas, apply_patch, x_search. They all ship schemas whether you call them or not.

The OpenClaw-managed browser deserves specific attention. Operators auditing their token spend consistently name it as the biggest single consumer. Disable it in openclaw.json:

{
  "browser": {
    "enabled": false
  }
}

See the OpenClaw browser config reference for the full schema. Then build a lightweight browser skill using an external agent-browser library when you actually need browsing. The heavy tokens only burn when you invoke browsing, instead of on every single message whether you browse or not.

If you have half a dozen MCP connections — Gmail, Calendar, Drive, Slack, Notion, Linear, GitHub — each one ships its full tool schema on every turn. That is often 10–20K tokens of JSON before the agent has done anything. Consolidating to a single Composio connection (or equivalent) replaces the stack with one schema.

For web_search specifically, set maxResults: 3 and configure the agent to check MEMORY.md before searching. A meaningful share of search calls are the agent re-fetching something it already knows.

Step 5: Fix session hygiene

Three ongoing habits:

/reset regularly. Once a task is complete, summarise the key points into MEMORY.md, then /reset and start a new session for the next task. The cost of forgetting old context is almost always lower than the cost of re-sending it on every turn.
Front-load the full task context. Short vague prompts burn tokens because the agent spends turns asking clarifying questions. One operator compared the same build handled two ways — a conversational back-and-forth that ran to 164K tokens, and the same work with the full brief loaded upfront that ran to 8K. Specificity compounds.
Always use --session isolated for cron jobs. Isolated sessions start clean with no history, run the task, and terminate. Both cheaper and cleaner than cron jobs that accumulate state across runs.

Step 6: Architect for cost when the basics are not enough

Once the low-hanging fruit is gone, four structural moves genuinely change the cost curve:

Split one bloated agent into focused ones. If your main agent handles email, code, research, and chat, every turn carries context for all four domains. Run two or three focused agents instead — personal ops, engineering, research — each with a lean workspace and only the tools it needs.

Run heavy tasks in throwaway sub-agents. Use sessions_spawn for research, refactors, and scraping. The sub-agent runs in an isolated session with minimal context, returns a result, and terminates. Your main session stays clean. You pay for the task, not for dragging its history around forever.

Add embeddings and a vector DB. Instead of re-sending raw memory text on every turn, push memory chunks through an embedding endpoint, store the vectors locally, and only send the nearest-neighbour slices into the expensive model. A typical memory chunk goes from 200–500 KB of raw text to 10–20 KB of matched snippets — operators report roughly 90% token reduction on memory-heavy workloads. Qdrant runs in a single container:

docker run -d --name qdrant -p 6333:6333 qdrant/qdrant

Milvus, Chroma, Weaviate, and RedisVector are all reasonable alternatives depending on what you already run. Then set the embedding endpoint in your OpenClaw config:

EMBEDDING_ENDPOINT=https://your-emb-api.com
EMBEDDING_MODEL=text-embedding-3-small

Route through a proxy layer. LiteLLM earns its complexity at scale by adding prompt caching (for deterministic calls like scheduled heartbeats), rate-limit handling with token-bucket bursts, and multi-provider fallbacks. A practical fallback chain: primary Anthropic, fall back to OpenAI on 5xx, fall back to local Ollama on sustained failures. You stay responsive during provider outages without burning tokens on retries.

Step 7: Build the feedback loop

OpenClaw does not surface a live cost dashboard, but it surfaces enough state to build one.

session_status returns token counts per run along with the model used. (See the session_status API reference.) Multiply by per-token price and you have cost per session. A daily cron that aggregates across sessions and writes a summary to a log (or Telegram, or Slack) is enough to catch a runaway before it runs the month.

Write thresholds into MEMORY.md so the monitoring agent can read and apply them:

## Budget thresholds
- Daily limit: $5.00 (alert at $4.00)
- Weekly limit: $20.00 (alert at $14.00)
- Per-agent daily: $2.00
- Alert channel: Telegram

Set maxConcurrentRuns on the Gateway as a soft spend control. Capping concurrency to three or four does not prevent expensive calls — it just prevents six of them from landing simultaneously, which smooths rate-limit exposure and makes daily spend more predictable.

The weekly 10-minute audit

Run this every Monday:

/context detail — what is in context this week that was not last week?
MEMORY.md line count under 80?
Tool access list still tight, or have new tools crept in?
Heartbeats still pointed at a cheap model?
Per-task and daily spend caps still active?

New skills, tools, and files sneak in over time. Ten minutes of auditing keeps the drift under control.

If any of this sounds like a weekly chore you'd rather not own, a managed host handles heartbeat defaults, gateway binding, and runtime caps out of the box — see our comparison of managed OpenClaw platforms or spin up KiloClaw in under five minutes.

FAQ

How much can I realistically save on OpenClaw API costs?

Operators who move heartbeats and cron jobs to cheap models while reserving premium models for sessions that need them consistently report 80–95% cost reduction. Exact savings depend on workload mix — setups heavy on scheduled automation save more than setups heavy on interactive sessions.

Does OpenClaw context compaction lose important information?

Sometimes, yes. Compaction produces summaries, and summaries lose detail. The mitigation is to write anything genuinely important into MEMORY.md explicitly before a session gets long enough for compaction to get aggressive. Treat MEMORY.md as a deliberate write target, not a place compaction will happen to preserve things.

Can I set a hard spend cap that automatically stops OpenClaw API calls?

Not natively. The budget-monitoring pattern (cron + session_status + alerts) gives visibility and soft limits. For a hard stop, you need a proxy layer with budget-aware virtual keys (LiteLLM supports this) or a custom skill that disables the gateway when spend crosses a threshold.

Is running OpenClaw through LiteLLM worth it for a small setup?

For a single agent with one provider, probably not — the operational overhead outweighs the caching benefit. LiteLLM earns its place when you have multiple agents, repetitive scheduled calls (where caching kicks in), or need multi-provider fallbacks for reliability.

Should I self-host OpenClaw or use a managed host?

Self-hosting gives maximum control and is usually cheapest at steady state, but it means you own the audit — everything in this guide becomes your responsibility. Managed hosts like KiloClaw handle heartbeat defaults, gateway exposure, and runtime caps out of the box, which is the right trade-off for teams who do not want a weekly audit on the calendar. The calculus is mostly about where you want to spend engineering time.