API Reference
The Kilo AI Gateway provides an OpenAI-compatible API. All endpoints use the base URL:
https://api.kilo.ai/api/gateway
Chat completions
Create a chat completion. This is the primary endpoint for interacting with AI models.
POST /chat/completions
Request body
type ChatCompletionRequest = {
// Required
model: string // Model ID (e.g., "anthropic/claude-sonnet-4.5")
messages: Message[] // Array of conversation messages
// Streaming
stream?: boolean // Enable SSE streaming (default: false)
// Generation parameters
max_tokens?: number // Maximum tokens to generate
temperature?: number // Sampling temperature (0-2)
top_p?: number // Nucleus sampling (0-1)
stop?: string | string[] // Stop sequences
frequency_penalty?: number // Frequency penalty (-2 to 2)
presence_penalty?: number // Presence penalty (-2 to 2)
// Tool calling
tools?: Tool[] // Available tools/functions
tool_choice?: ToolChoice // Tool selection strategy
// Structured output
response_format?: ResponseFormat
// Other
user?: string // End-user identifier for safety
seed?: number // Deterministic sampling seed
}
Message types
type Message =
| { role: "system"; content: string }
| { role: "user"; content: string | ContentPart[] }
| { role: "assistant"; content: string | null; tool_calls?: ToolCall[] }
| { role: "tool"; content: string; tool_call_id: string }
type ContentPart = { type: "text"; text: string } | { type: "image_url"; image_url: { url: string; detail?: string } }
type Tool = {
type: "function"
function: {
name: string
description?: string
parameters: object // JSON Schema
}
}
type ToolChoice = "none" | "auto" | "required" | { type: "function"; function: { name: string } }
Response (non-streaming)
type ChatCompletionResponse = {
id: string
object: "chat.completion"
created: number
model: string
choices: Array<{
index: number
message: {
role: "assistant"
content: string | null
tool_calls?: ToolCall[]
}
finish_reason: "stop" | "length" | "tool_calls" | "content_filter"
}>
usage: {
prompt_tokens: number
completion_tokens: number
total_tokens: number
}
}
Response (streaming)
When stream: true, the response is a series of SSE events:
type ChatCompletionChunk = {
id: string
object: "chat.completion.chunk"
created: number
model: string
choices: Array<{
index: number
delta: {
role?: "assistant"
content?: string
tool_calls?: ToolCall[]
}
finish_reason: string | null
}>
// Only in the final chunk
usage?: {
prompt_tokens: number
completion_tokens: number
total_tokens: number
}
}
Example request
curl -X POST "https://api.kilo.ai/api/gateway/chat/completions" \
-H "Authorization: Bearer $KILO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4.5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is quantum computing?"}
],
"max_tokens": 500,
"temperature": 0.7
}'
Example response
{
"id": "gen-abc123",
"object": "chat.completion",
"created": 1739000000,
"model": "anthropic/claude-sonnet-4.5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum computing is a type of computation that uses quantum mechanics..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 150,
"total_tokens": 175
}
}
Tool calling
The gateway supports function/tool calling with automatic repair for common issues like duplicate tool calls and orphan cleanup.
Request with tools
{
"model": "anthropic/claude-sonnet-4.5",
"messages": [{ "role": "user", "content": "What's the weather in San Francisco?" }],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name"
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}
Tool call response
{
"choices": [
{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"San Francisco\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}
Tool call repair
The gateway automatically handles common tool calling issues:
- Deduplication: Removes duplicate tool calls with the same ID
- Orphan cleanup: Removes tool result messages without matching tool calls
- Missing results: Inserts placeholder results for tool calls without responses
- ID normalization: Normalizes tool call IDs per provider requirements (Anthropic, Mistral)
FIM completions
Fill-in-the-middle completions for code generation, powered by Mistral Codestral.
POST /api/fim/completions
Request body
type FIMRequest = {
model: string // Must be a Mistral model (e.g., "mistralai/codestral-2508")
prompt: string // Code before the cursor
suffix?: string // Code after the cursor
max_tokens?: number // Maximum tokens (capped at 1000)
temperature?: number
stop?: string[]
stream?: boolean
}
Example request
curl -X POST "https://api.kilo.ai/api/fim/completions" \
-H "Authorization: Bearer $KILO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/codestral-2508",
"prompt": "def fibonacci(n):\n if n <= 1:\n return n\n ",
"suffix": "\n\nprint(fibonacci(10))",
"max_tokens": 200,
"stream": false
}'
FIM completions are limited to Mistral models (model IDs starting with mistralai/). BYOK is supported with the codestral key type.
List models
Retrieve the list of available models.
GET /models
No authentication required.
Response
Returns an OpenAI-compatible model list:
{
"data": [
{
"id": "anthropic/claude-sonnet-4.5",
"object": "model",
"created": 1739000000,
"owned_by": "anthropic",
"name": "Claude Sonnet 4.5",
"context_length": 200000,
"pricing": {
"prompt": "0.000003",
"completion": "0.000015"
}
}
]
}
List providers
Retrieve the list of available providers.
GET /providers
No authentication required.
Error codes
| HTTP Status | Description |
|---|---|
| 400 | Bad request -- invalid parameters or model ID |
| 401 | Unauthorized -- invalid or missing API key |
| 402 | Insufficient balance -- add credits to continue |
| 403 | Forbidden -- model not allowed by organization policy |
| 429 | Rate limited -- too many requests |
| 500 | Internal server error |
| 502 | Provider error -- upstream provider returned an error |
| 503 | Service unavailable -- provider temporarily unavailable |
Error response format
{
"error": {
"message": "Human-readable error description",
"code": 400
}
}
When the gateway receives a 402 (Payment Required) from an upstream provider, it returns 503 to the client to avoid exposing internal billing details.
Context length errors
If your request exceeds the model's context window, you'll receive a descriptive error:
{
"error": {
"message": "This request exceeds the model's context window of 200000 tokens. Your request contains approximately 250000 tokens.",
"code": 400
}
}