Evals
Quality as a gate, not a feeling — suites, rubrics, and promotion that requires proof.
The evals pillar is how changes earn their way to production. An eval suite is a versioned collection of cases — real inputs, expected properties of the output, and a rubric — that you pin to anything that can regress: a prompt, a workflow, a fine-tuned adapter, a model version. When any of those changes, the suite runs, y0-judge scores every case against the rubric, and the result is a diff: which cases improved, which regressed, and by how much, with each judgment citing the rubric line and evidence span it applied. Promotion is mechanical — if the candidate clears the thresholds you set, it ships; if not, it does not, and no amount of 'it feels better' overrides the gate. The same machinery runs continuously against production: a configurable sample of live runs is scored within minutes, charted per workflow, and alerting fires on sustained degradation rather than single bad outputs. Building suites is deliberately cheap, because eval coverage behaves like test coverage: any trace can be promoted to a case in one click, corrected outputs become expected properties automatically, and the cookbook ships rubric packs for the common shapes — summarization faithfulness, extraction exactness, tone compliance — so teams start from a standard instead of a blank page.
[ 01 ]Key features
Suites pinned to what can regress
Prompts, workflows, adapters, and model versions each carry a suite — change anything and the suite runs before promotion.
Diffs, not vibes
Every candidate produces a case-by-case comparison against the incumbent, with explained judgments citing rubric and evidence.
Continuous production sampling
Live runs are scored on a sliding sample; sustained quality drift alerts within hours, not at renewal time.
Traces become cases
Promote any trace to an eval case in one click — coverage grows out of real work instead of invented examples.
[ 02 ][ suite result ]
{
"suite": "weekly-brief-v4",
"candidate": "prompt@9f31",
"incumbent": "prompt@8aa2",
"cases": 64,
"improved": 11,
"regressed": 2,
"score": { "candidate": 0.91, "incumbent": 0.88 },
"gate": "pass — promoted"
}