Benchmarking

ℹ️Status

Partial - inspected repositories show a Harbor-facing smoke-eval workflow and cloud model-eval-ingest promotion sync. Broader Harbor adapters, ATIF traces, Opik workflows, and commands remain unverified roadmap items.

Overview

Benchmarking should answer two questions:

  1. How do models compare when used by same Kilo Code agent?
  2. How do agents or Kilo Code versions compare when used with same model?

This page separates inspected repository evidence from roadmap. It does not guarantee private benchmark tooling, external adapters, or example commands are available to contributors.

ℹ️Info

Benchmarking is separate from production observability. Observability monitors real sessions. Benchmarking runs controlled evaluation tasks.

Current evidence

CapabilityStatusEvidence and limits
Harbor-facing smoke evalCurrent workflow.github/workflows/smoke-test.yml checks out private Kilo-Org/kilo-bench, installs dependencies, and runs two smoke tasks through repository scripts
CLI release smoke coverageCurrent workflowWorkflow can test latest npm CLI or requested release asset before validating results
Smoke result artifactsCurrent workflowWorkflow uploads result, trajectory, and agent setup files for inspection
Cloud model eval ingestCurrent serviceStatic source inspection found services/model-eval-ingest/ promotion sync surface
Private kilo-bench internalsNot verified herePrivate repository scripts, adapter behavior, and supported local commands are outside inspected docs scope
Live production enablementNot verified hereStatic source does not prove deployment, rollout, retention, or vendor configuration

Roadmap

CapabilityStatusIntended use
Contributor-facing Harbor adapterUnverified roadmapRun Kilo CLI autonomously in controlled evaluation environments
ATIF trajectory adapterUnverified roadmapEmit structured step-level traces for comparison
Opik integrationUnverified roadmapIngest traces and compare evaluation runs
Standard model comparison workflowPlannedCompare quality, cost, and wall-clock time across models
Standard agent comparison workflowPlannedCompare agents or Kilo Code versions on same tasks
Custom task-set templatePlannedBuild focused regression or capability suites
CI regression suite beyond smoke evalPlannedRun stable subset before release

Inspected smoke-eval workflow

Current repository workflow runs small smoke evaluation after checking out private benchmark repository. It uses private repository script ./scripts/run_eval.sh, validates output with scripts/validate_smoke_test.py, and uploads selected artifacts.

TaskDataset selectionExpected scope recorded in workflow
hello-worldhello-worldSmall smoke task
log-summary-date-rangesterminal-bench-sample with included task nameSmall terminal benchmark sample

This evidence shows smoke coverage exists. It does not establish public Harbor adapter contract or contributor-ready local CLI.

Cloud model-eval-ingest evidence

Static source inspection found cloud model-eval-ingest service for promotion sync. Treat this as current repository-defined surface only. Validate deployed environment and operational behavior separately before making production claims.

Proposed evaluation design

Broader design can use open-source evaluation components if adapter availability is verified during implementation.

ComponentRoadmap roleVerification needed
HarborEvaluation harness and datasetsConfirm supported Kilo adapter and invocation contract
ATIFStructured trajectoriesConfirm emitted fields and reasoning-data policy
OpikTrace ingestion and analysisConfirm Harbor integration setup and Kilo adapter support
Terminal-Bench or other datasetsControlled tasksConfirm versions, licensing, and task selection

Potential architecture:

Evaluation task set
  -> controlled trial environment
  -> verified Kilo adapter
  -> model request
  -> result and optional trajectory artifacts
  -> smoke validation, aggregate analysis, or trace analysis

Proposed comparison dimensions

ComparisonFixed inputVariableMeasures
Model comparisonKilo Code agent and task setModelCompletion, cost, and wall-clock time
Agent comparisonModel and task setAgent or Kilo Code versionCompletion, cost, and wall-clock time
Trace analysisEvaluation taskRun trajectoryTool choices, errors, and repeated steps

Command verification requirement

Do not document opik harbor run -a kilo, kilo --auto, or kilo run --auto as ready-to-run interfaces until adapter and autonomous CLI invocation are verified in relevant repository. Private kilo-bench workflow commands are implementation evidence, not public usage guarantees.

Future deliverables

  • Verify and document supported autonomous CLI invocation
  • Verify Harbor adapter ownership and availability
  • Define ATIF export fields and data-handling policy
  • Validate Opik ingestion path before publishing commands
  • Publish contributor workflow only after local reproduction succeeds
  • Expand smoke coverage into stable regression subset where cost and runtime allow

References