Kilo Code - Agent Observability

Problem Statement

Agentic coding systems like Kilo Code operate with significant autonomy, executing multi-step tasks that involve LLM inference, tool execution, file manipulation, and external API calls. These systems mix traditional systems observability (i.e. request/response) with agentic behavior (i.e. planning, reasoning, and tool use).

At the lower level, we can observe the system as a traditional API, but at the higher level, we need to observe the agent's behavior and the quality of its outputs.

Some examples of customer-facing error modes:

Model API calls may be slow or fail due to rate limits, network issues, or model unavailability
Model API calls may produce invalid JSON or malformed responses
An agent may get stuck in a loop, repeatedly attempting the same failing operation
Sessions may degrade gradually as context windows fill up
The agent may complete a task technically but produce incorrect or unhelpful output
Users may abandon sessions out of frustration without explicit error signals

All of these contribute to the overall reliability and user experience of the system.

Goals

Detect and alert on acute incidents within minutes
Surface slow-burn degradations within hours
Facilitate root cause analysis when issues occur
Track quality and efficiency trends over time
Build a foundation for continuous improvement of the agent

Non-goals for this proposal:

Automated remediation
A/B testing infrastructure

Proposed Approach

Focus on the lower-level systems observability first, then build up to higher-level agentic behavior observability.

Phase 1: Systems Observability

Objective: Establish awareness and alerting for hard failures.

This phase focuses on systems metrics we can capture with minimal changes, providing immediate operational visibility.

Phase 1a: LLM observability and alerting

Metrics to Capture

Capture these metrics per LLM API call:

Provider
Model
Tool
Latency
Success / Failure
Error type and message (if failed)
Token counts

Dashboards

Common dashboards which offer filtering based on provider, model, and tool:

Error rate
Latency
Token usage

Alerting

Implement multi-window, multi-burn-rate alerting against error budgets:

Window	Burn Rate	Action	Use Case
5 min	14.4x	Page	Major Outage
30 min	6x	Page	Incident
6 hr	1x	Ticket	Change in behavior

Paging should only occur on Recommended Models when using the Kilo Gateway. All other alerts should be tickets, and some may be configured to be ignored.

Initial alert conditions:

LLM API error rate exceeds SLO (per tool/model/provider)
Tool error rate exceeds SLO (per tool/model/provider)
p50/p90 latency exceeds SLO (per tool/model/provider)

Phase 1b: Session metrics

Metrics to Capture

Per-session (aggregated at session close or timeout):

Session duration
Time from user input to first model response
Total turns/steps
Total tool calls by tool type
Total errors by error type
- Agent stuck errors (repetitive tool calls, etc)
- Tool call errors
Total tokens consumed
Context condensing frequency
Termination reason (user closed, timeout, explicit completion, error)

Alerting

None.

Phase 2: Agent Tool Usage

Objective: Detect how agents are using tools in a given session.

Metrics to Capture

Loop and repetition detection:

Count of identical tool calls within a session (same tool + same arguments)
Count of identical failing tool calls (same tool + same arguments + same error)
Detection of oscillation patterns (alternating between two states)

Progress indicators:

Unique files touched per session
Unique tools used per session
Ratio of repeated to unique operations

Alerting

None to start, we will learn.

Phase 3: Session Outcome Tracking

Objective: Understand whether sessions are successful from the user's perspective.

Hard errors and behavior metrics tell us about failures, but we also need signal on overall session health.

Metrics to Capture

Explicit signals:

User feedback (thumbs up/down) rate and sentiment
User abandonment patterns (session ends mid-task without completion signal)

Implicit signals:

May require LLM analysis of session transcripts to detect:

Session termination classification (completed, abandoned, errored, timed out)

Problem Statement​

Goals​

Proposed Approach​

Phase 1: Systems Observability​

Phase 1a: LLM observability and alerting​

Metrics to Capture​

Dashboards​

Alerting​

Phase 1b: Session metrics​

Metrics to Capture​

Alerting​

Phase 2: Agent Tool Usage​

Metrics to Capture​

Alerting​

Phase 3: Session Outcome Tracking​

Metrics to Capture​

Problem Statement

Goals

Proposed Approach

Phase 1: Systems Observability

Phase 1a: LLM observability and alerting

Metrics to Capture

Dashboards

Alerting

Phase 1b: Session metrics

Metrics to Capture

Alerting

Phase 2: Agent Tool Usage

Metrics to Capture

Alerting

Phase 3: Session Outcome Tracking

Metrics to Capture