Y0 Evaluation

available

Graded, not vibes — score runs against rubrics before and after they ship.

[ 01 ]Spec sheet

statusavailable
context window32k tokens per judgment
latency p50650 ms per judgment
price$1.00 in / $4.00 out

Y0 Evaluation is the family that answers the question every team eventually asks: is it actually good? It is a set of judge models trained to score outputs against rubrics — faithfulness to attached context, completeness against the ask, tone fit, policy compliance, format validity — and calibrated against human raters so a 0.85 means the same thing next month as it does today. It runs in two places. Offline, it powers the evals pillar: you assemble a suite of cases, pin it to a model version, and every release candidate gets graded before it can be promoted, turning 'the new model feels worse' into a diff you can read. Online, it samples production runs and scores them continuously, so quality regressions show up on a dashboard within hours instead of in a churned customer's exit interview. Judgments are themselves traced — every score ships with the rubric line it applied and the evidence span it weighed, because an unexplained 0.4 is just another opinion. Teams also use it as a gate inside agent workflows: an output below threshold gets retried or routed to a human instead of sent. Generally available; judge rubric packs for common tasks ship in the cookbook.

[ 02 ]Capabilities

Rubric-based scoring calibrated against human raters

Faithfulness checks that verify claims against attached context

Suite runs that gate model and prompt promotions in CI

Continuous sampling of production traffic with drift alerts

Explained judgments — every score cites rubric line and evidence

[ 03 ]Best for

Regression-gating prompt and model changes before release

Production quality monitoring without manual spot checks

In-workflow gates that catch bad outputs before they send

[ 04 ]Sample request

[ request ]application/json
{
  "model": "y0-judge",
  "run_id": "run_4af2c19e",
  "rubric": "rb_faithfulness_v2",
  "threshold": 0.8,
  "on_fail": "flag"
}