Usage & Billing
The Kilo AI Gateway tracks usage and costs for every request with microdollar precision (1 USD = 1,000,000 microdollars). This enables accurate billing even for very low-cost requests.
How billing works
Every request to the gateway follows this flow:
- Balance check: Before proxying the request, the gateway verifies you have sufficient balance
- Request execution: The request is sent to the upstream provider
- Usage tracking: Token counts and costs are extracted from the response
- Balance update: Your balance is atomically updated with the request cost
Cost calculation
Costs are determined by the upstream provider's pricing based on token usage:
- Input tokens: Tokens in your prompt (system message, user messages, tool definitions)
- Output tokens: Tokens generated by the model
- Cache write tokens: Tokens written to the provider's prompt cache
- Cache hit tokens: Tokens served from the provider's prompt cache (typically discounted)
Free and BYOK requests
- Free models: Models tagged with
:freehave zero cost -- usage is tracked but not billed - BYOK requests: When using your own API key, the cost is set to $0 on Kilo's side. You pay the provider directly based on your agreement with them
Balance management
Individual accounts
Your account balance is the difference between total credits purchased and total usage. Check your balance in the Kilo dashboard.
When your balance reaches zero, requests to paid models will return HTTP 402 with a link to add credits:
{
"error": {
"message": "Insufficient balance. Please add credits to continue.",
"code": 402,
"metadata": {
"buyCreditsUrl": "https://app.kilo.ai/credits"
}
}
}
Organization accounts
Organizations have their own balance pool that members draw from. Organization billing supports:
- Shared balance: All members use a common credit pool
- Per-user daily limits: Cap individual member spending (e.g., $5/day per user)
- Auto top-up: Automatically replenish credits when the balance drops below a threshold
- Minimum balance alerts: Email notifications when the balance drops below a configured amount
Organization controls
Organizations can enforce policies on gateway usage for their members.
Model allow lists
Restrict which models organization members can use:
# Examples of allow list entries anthropic/claude-sonnet-4.5 # Specific model anthropic/* # All Anthropic models openai/gpt-5.2 # Specific OpenAI model
The allow list supports exact matches and wildcard patterns. Requests for models not on the list return HTTP 403.
Provider allow lists
Restrict which inference providers can be used for routing. This is passed to the upstream router and affects which backends serve the request.
Data collection controls
Organizations can set a data collection policy (allow or deny) that is applied to all requests from their members. Some free models require data collection to be allowed.
Per-user daily spending limits
Set a maximum daily spend per organization member. When a member reaches their daily limit, subsequent requests return a balance error. The daily limit resets at midnight UTC.
Rate limiting
Free model rate limits
All free model requests (both anonymous and authenticated) are rate-limited by IP address:
| Scope | Limit |
|---|---|
| Free models per IP | 200 requests per hour |
When rate-limited, you receive HTTP 429:
{
"error": {
"message": "Rate limit exceeded for free models. Please try again later.",
"code": 429
}
}
Paid model limits
Paid model requests are not rate-limited by the gateway itself, but may be rate-limited by upstream providers. Organization per-user daily spending limits provide an additional layer of cost control.
Usage data
Usage data is tracked per request and includes:
| Field | Description |
|---|---|
model | Model ID used |
provider | Inference provider that served the request |
input_tokens | Number of input/prompt tokens |
output_tokens | Number of output/completion tokens |
cache_write_tokens | Tokens written to cache |
cache_hit_tokens | Tokens served from cache |
cost_microdollars | Cost in microdollars (1 USD = 1,000,000) |
time_to_first_token | Latency to first token (streaming only) |
is_byok | Whether a BYOK key was used |
Token counting
Token counts are provided by the upstream model and are based on the model's native tokenizer. The gateway does not re-tokenize content. Usage data is available:
- Non-streaming: In the
usagefield of the response body - Streaming: In the final SSE chunk before
[DONE]