Evals

Quality as a gate, not a feeling — suites, rubrics, and promotion that requires proof.

The evals pillar is how changes earn their way to production. An eval suite is a versioned collection of cases — real inputs, expected properties of the output, and a rubric — that you pin to anything that can regress: a prompt, a workflow, a fine-tuned adapter, a model version. When any of those changes, the suite runs, y0-judge scores every case against the rubric, and the result is a diff: which cases improved, which regressed, and by how much, with each judgment citing the rubric line and evidence span it applied. Promotion is mechanical — if the candidate clears the thresholds you set, it ships; if not, it does not, and no amount of 'it feels better' overrides the gate. The same machinery runs continuously against production: a configurable sample of live runs is scored within minutes, charted per workflow, and alerting fires on sustained degradation rather than single bad outputs. Building suites is deliberately cheap, because eval coverage behaves like test coverage: any trace can be promoted to a case in one click, corrected outputs become expected properties automatically, and the cookbook ships rubric packs for the common shapes — summarization faithfulness, extraction exactness, tone compliance — so teams start from a standard instead of a blank page.

[ 01 ]Key features

Suites pinned to what can regress

Prompts, workflows, adapters, and model versions each carry a suite — change anything and the suite runs before promotion.

Diffs, not vibes

Every candidate produces a case-by-case comparison against the incumbent, with explained judgments citing rubric and evidence.

Continuous production sampling

Live runs are scored on a sliding sample; sustained quality drift alerts within hours, not at renewal time.

Traces become cases

Promote any trace to an eval case in one click — coverage grows out of real work instead of invented examples.

[ 02 ][ suite result ]

{
  "suite": "weekly-brief-v4",
  "candidate": "prompt@9f31",
  "incumbent": "prompt@8aa2",
  "cases": 64,
  "improved": 11,
  "regressed": 2,
  "score": { "candidate": 0.91, "incumbent": 0.88 },
  "gate": "pass — promoted"
}

Getting started

Keep exploring

[next]Observabilitypillar 08

[prev]Agentspillar 06