How to Cut AI Coding Spend in Half

Every week, engineering leaders ask for the same thing: a spending cap on AI coding tools.

They want a clean dial—"$10 per user per day," "freeze usage at X dollars per month"—something that sounds like responsible budget governance. The instinct is understandable. When costs are new and feel unpredictable, a hard limit feels like control.

But that instinct is wrong. And increasingly, the most sophisticated engineering organizations in the world are proving it.

The real answer to AI coding spend isn't friction. It's infrastructure.

The Cap Trap

Hard spending caps break the exact thing you're paying for.

Picture a developer deep in the middle of refactoring a critical subsystem. Their agent is drafting a migration plan, modifying schemas across dozens of files, running parallel agents to validate compatibility. Mid-task—right as they're about to apply the changes—the workflow stops. The org hit its daily cap.

Now their workspace is in a half-applied state. Context is fragmented. The cost to productivity dwarfs the few dollars "saved" by the cap. The cap interrupted the work precisely when the work mattered most.

This is why Kilo holds to a product principle that doesn't bend: Never hard-block users with spending limits.

But "don't use caps" isn't a cost strategy on its own. Teams need a real strategy. And the most instructive example of what that actually looks like comes from Coinbase.

What Coinbase Actually Did

In June 2026, Coinbase CEO Brian Armstrong shared how his team was managing AI infrastructure costs—not by throttling developers, but by systematically improving the infrastructure underneath them.

The result: AI spend cut nearly in half, even as token usage continued to grow.

Here's what they actually did:

1. Better Defaults

Coinbase discovered that 91% of employees were never hitting usage caps in the first place. Their team started experimenting with open-weight models—GLM 5.2 and Kimi 2.7—as defaults in their LLM gateway. For most developers doing routine tasks, open-weight coding models can be sufficient. Defaulting to a capable, cheaper model means the upgrade to frontier only happens when genuinely needed.

2. Smarter Routing

Rather than requiring developers to manually select models for each task, Coinbase built preprocessing logic to route prompts to the best model based on cache hits, pricing, and task type. Armstrong summarized the logic plainly: "Frontier model for planning, cheaper models for execution. Humans shouldn't be choosing models—AI can automate this task."

This is the key insight. Model selection at scale shouldn't be a human decision. It should be an automated policy.

3. Real Prompt Caching

Perhaps the most striking data point: cache hit rate went from 5% to 60% once LibreChat was properly configured. That's a 12x improvement—and it means they're processing a huge fraction of their AI workload for a fraction of the cost, just by not recomputing what they've already computed.

4. Leaner Context

The team also adopted simple habits: starting fresh sessions when switching tasks, scoping file context narrowly, and disconnecting unused tools. These aren't dramatic engineering changes. They're defaults and practices.

5. Making Usage Visible

Perhaps underrated: Coinbase made usage visible to employees. Armstrong noted that the expectation is clear—the more you spend on AI, the more impact expected. Visibility creates accountability without creating friction.

A Procurement Story Playing Out in Real Time

Coinbase's shift to open-weight models like GLM-5.2 and Kimi 2.7 isn't unique. It reflects a broader pattern that's accelerating across the industry.

In June 2026, developer @yuhasbeentaken compiled a list of western companies that have moved significant AI workloads to Chinese open-weight models:

Company	Model
Coinbase	GLM-5.2 + Kimi 2.7
Cursor	Kimi K2.5
Lindy	DeepSeek v4
Shopify	Qwen
Airbnb	Qwen
Uber Eats	Qwen2
Siemens	DeepSeek + Qwen
Microsoft	Testing DeepSeek v4

As the post put it: "It's becoming a procurement story."

This is not about ideology or geopolitics. These companies are optimizing for the same thing every engineering organization should optimize for: the best output per dollar for each type of task. When a Chinese open-weight model delivers equivalent quality to a frontier model at a fraction of the cost for a given class of work, procurement departments notice.

The practical implication: a single-vendor, single-model AI coding strategy is increasingly expensive by default. Teams that can route across a diverse model ecosystem—including open-weight alternatives—have access to a cost lever that single-vendor shops do not. For a tactical breakdown of where these savings come from, see how to reduce AI coding costs for your engineering team.

The Five Infrastructure Levers

Taken together, the Coinbase story and the broader market shift point to five infrastructure levers that actually move the needle on AI coding costs:

1. Task-aware routing. Not every task requires a frontier model. Documentation, test generation, bounded refactors, and summarization often work fine on efficient or open-weight models. Architecture decisions, security-sensitive changes, and subtle debugging may genuinely require the strongest available reasoning. The question is whether routing is a manual decision (it shouldn't be) or an automated policy (it should be).

2. Prompt caching. Cache hit rates below 20% are wasteful for most engineering workflows. System prompts, repository context, and recurring patterns can all be cached. The Coinbase example—5% to 60%—shows how much room typically exists to improve this.

3. Sensible defaults. If most developers never hit usage caps, what model are they defaulting to? If it's a frontier model because that's what the UI opened with, that's a pure infrastructure cost with no benefit. Setting an open-weight or efficient model as the default—with easy escalation—is a structural saving that requires no developer behavior change.

4. Context discipline. Long context windows are expensive. The practice of starting fresh sessions when tasks change, attaching specific files rather than entire folders, and trimming noisy output isn't just good engineering hygiene—it's meaningful cost management.

5. Visibility, not restrictions. Usage data should be visible to developers, not just to finance. When engineers can see what they're spending and what value they're getting, behavior improves on its own. The goal is accountability, not a speedbump.

What This Means for Your Team

The cap instinct makes sense for expenses with no upside—extra seats, redundant SaaS tools, forgotten subscriptions. It makes no sense for developer productivity tooling where the cost is directly tied to output.

As Armstrong put it in a phrase that captures the entire principle: token usage growing while spend falls is the goal. That means developers are doing more, faster—and the infrastructure is smart enough to deliver that work efficiently.

At Kilo, this is the foundation of how cost management works. Kilo gives teams:

500+ hosted models including open-weight models, so you're never locked into one cost tier—all at zero-markup pricing
Auto Model routing that intelligently selects the right model for the task complexity without requiring developer decision-making
BYOK support so teams can use existing provider agreements and credits
Local model support via Ollama and LM Studio for privacy-sensitive or high-volume bounded work
Organization dashboards with request-level visibility into model, tokens, cache activity, and cost—before finance sends a surprise. See the full enterprise platform for team controls.

The most sophisticated engineering orgs aren't asking "how do we limit AI usage?" They're asking "how do we remove every obstacle between our developers and shipping?"

That's the question worth spending your time on.

The Infrastructure Play: How Smart Companies Cut AI Coding Spend in Half