Y0 Evaluation
availableGraded, not vibes — score runs against rubrics before and after they ship.
[ 01 ]Spec sheet
Y0 Evaluation is the family that answers the question every team eventually asks: is it actually good? It is a set of judge models trained to score outputs against rubrics — faithfulness to attached context, completeness against the ask, tone fit, policy compliance, format validity — and calibrated against human raters so a 0.85 means the same thing next month as it does today. It runs in two places. Offline, it powers the evals pillar: you assemble a suite of cases, pin it to a model version, and every release candidate gets graded before it can be promoted, turning 'the new model feels worse' into a diff you can read. Online, it samples production runs and scores them continuously, so quality regressions show up on a dashboard within hours instead of in a churned customer's exit interview. Judgments are themselves traced — every score ships with the rubric line it applied and the evidence span it weighed, because an unexplained 0.4 is just another opinion. Teams also use it as a gate inside agent workflows: an output below threshold gets retried or routed to a human instead of sent. Generally available; judge rubric packs for common tasks ship in the cookbook.
[ 02 ]Capabilities
Rubric-based scoring calibrated against human raters
Faithfulness checks that verify claims against attached context
Suite runs that gate model and prompt promotions in CI
Continuous sampling of production traffic with drift alerts
Explained judgments — every score cites rubric line and evidence
[ 03 ]Best for
Regression-gating prompt and model changes before release
Production quality monitoring without manual spot checks
In-workflow gates that catch bad outputs before they send
[ 04 ]Sample request
{
"model": "y0-judge",
"run_id": "run_4af2c19e",
"rubric": "rb_faithfulness_v2",
"threshold": 0.8,
"on_fail": "flag"
}