Evaluations

if it isn't measured, it isn't a capability

Every routing decision in the runtime — which model handles which task — is backed by an evaluation, because 'the bigger model is probably better' is a cost decision pretending to be a quality decision. Our evals are narrow on purpose: they test the tasks Y0 actually performs, against rubrics written from real user workflows, not academic benchmarks. The honest caveat is that evals are a floor, not a ceiling. A model that passes our suite can still fail a user in ways the suite never imagined, and public benchmark scores tell you almost nothing about whether a model can prepare an invoice without inventing a line item. We publish our methodology rather than our scores, because scores without the rubric are advertising.

[ what we actually run ]

[01]

Task-level eval suites, not benchmark scores

Each task family in the runtime — summarize, draft, extract, schedule, reconcile — has its own suite built from anonymized, consented real workflows. A model earns routing eligibility per task family, never globally.

[02]

Regression gates on every routing change

Changing which model serves a task requires the candidate to beat the incumbent on the task suite and tie-or-beat it on cost and latency. Routing changes ship with the eval diff attached, so the decision is auditable later.

[03]

Hallucination accounting for factual tasks

For tasks that assert facts from user context — amounts, dates, names — the suite measures fabrication explicitly: claims in the output are checked against the source context, and any unsupported claim is a scored failure, weighted heavier than an omission.

[04]

Drift checks on a schedule, not on suspicion

Hosted models change underneath their version strings. The full suite re-runs on every task family weekly and on any provider model update we detect, so quality drift is caught by instrumentation rather than by user complaints.

[ open questions — honestly ]

  • Our rubrics encode our judgment of a good answer, and our users sometimes disagree with our judgment. We have not solved how to fold individual user preference back into evals without overfitting the product to its loudest users.
  • Eval suites built from real workflows go stale as workflows evolve. We do not yet have a reliable signal for when a suite has quietly stopped representing the job it claims to measure.