Benchmarking

ℹ️Status

Partial - inspected repositories show a Harbor-facing smoke-eval workflow and cloud model-eval-ingest promotion sync. Broader Harbor adapters, ATIF traces, Opik workflows, and commands remain unverified roadmap items.

Overview

Benchmarking should answer two questions:

How do models compare when used by same Kilo Code agent?
How do agents or Kilo Code versions compare when used with same model?

This page separates inspected repository evidence from roadmap. It does not guarantee private benchmark tooling, external adapters, or example commands are available to contributors.

ℹ️Info

Benchmarking is separate from production observability. Observability monitors real sessions. Benchmarking runs controlled evaluation tasks.

Current evidence

Capability	Status	Evidence and limits
Harbor-facing smoke eval	Current workflow	`.github/workflows/smoke-test.yml` checks out private `Kilo-Org/kilo-bench`, installs dependencies, and runs two smoke tasks through repository scripts
CLI release smoke coverage	Current workflow	Workflow can test latest npm CLI or requested release asset before validating results
Smoke result artifacts	Current workflow	Workflow uploads result, trajectory, and agent setup files for inspection
Cloud model eval ingest	Current service	Static source inspection found `services/model-eval-ingest/` promotion sync surface
Private `kilo-bench` internals	Not verified here	Private repository scripts, adapter behavior, and supported local commands are outside inspected docs scope
Live production enablement	Not verified here	Static source does not prove deployment, rollout, retention, or vendor configuration

Roadmap

Capability	Status	Intended use
Contributor-facing Harbor adapter	Unverified roadmap	Run Kilo CLI autonomously in controlled evaluation environments
ATIF trajectory adapter	Unverified roadmap	Emit structured step-level traces for comparison
Opik integration	Unverified roadmap	Ingest traces and compare evaluation runs
Standard model comparison workflow	Planned	Compare quality, cost, and wall-clock time across models
Standard agent comparison workflow	Planned	Compare agents or Kilo Code versions on same tasks
Custom task-set template	Planned	Build focused regression or capability suites
CI regression suite beyond smoke eval	Planned	Run stable subset before release

Inspected smoke-eval workflow

Current repository workflow runs small smoke evaluation after checking out private benchmark repository. It uses private repository script ./scripts/run_eval.sh, validates output with scripts/validate_smoke_test.py, and uploads selected artifacts.

Task	Dataset selection	Expected scope recorded in workflow
`hello-world`	`hello-world`	Small smoke task
`log-summary-date-ranges`	`terminal-bench-sample` with included task name	Small terminal benchmark sample

This evidence shows smoke coverage exists. It does not establish public Harbor adapter contract or contributor-ready local CLI.

Cloud model-eval-ingest evidence

Static source inspection found cloud model-eval-ingest service for promotion sync. Treat this as current repository-defined surface only. Validate deployed environment and operational behavior separately before making production claims.

Proposed evaluation design

Broader design can use open-source evaluation components if adapter availability is verified during implementation.

Component	Roadmap role	Verification needed
Harbor	Evaluation harness and datasets	Confirm supported Kilo adapter and invocation contract
ATIF	Structured trajectories	Confirm emitted fields and reasoning-data policy
Opik	Trace ingestion and analysis	Confirm Harbor integration setup and Kilo adapter support
Terminal-Bench or other datasets	Controlled tasks	Confirm versions, licensing, and task selection

Potential architecture:

Evaluation task set
  -> controlled trial environment
  -> verified Kilo adapter
  -> model request
  -> result and optional trajectory artifacts
  -> smoke validation, aggregate analysis, or trace analysis

Proposed comparison dimensions

Comparison	Fixed input	Variable	Measures
Model comparison	Kilo Code agent and task set	Model	Completion, cost, and wall-clock time
Agent comparison	Model and task set	Agent or Kilo Code version	Completion, cost, and wall-clock time
Trace analysis	Evaluation task	Run trajectory	Tool choices, errors, and repeated steps

Command verification requirement

Do not document opik harbor run -a kilo, kilo --auto, or kilo run --auto as ready-to-run interfaces until adapter and autonomous CLI invocation are verified in relevant repository. Private kilo-bench workflow commands are implementation evidence, not public usage guarantees.

Future deliverables

Verify and document supported autonomous CLI invocation
Verify Harbor adapter ownership and availability
Define ATIF export fields and data-handling policy
Validate Opik ingestion path before publishing commands
Publish contributor workflow only after local reproduction succeeds
Expand smoke coverage into stable regression subset where cost and runtime allow