Y0 Benchmarks

available

The numbers we hold ourselves to, published as measured.

[ 01 ]Spec sheet

statusavailable

This page is not a model — it is the scoreboard. Mynd maintains an internal benchmark suite that every Y0 release must clear before promotion, and we publish the current results here rather than quoting the public leaderboards everyone has learned to discount. The suite is built from the work the platform actually does: long-document faithfulness, schema-exact extraction, plan quality under enforced step ceilings, retrieval precision over realistic private corpora, and end-to-end agent task completion with scope checks on. Two columns matter: y0-fast, the interactive profile most requests use, and y0-deep, the deliberate profile behind reasoning and agent planning. Numbers are re-measured on every release by the evaluation family, judged against frozen rubrics, and the history is kept — when a number moves, the changelog says why. Read the notes column; a benchmark without its caveat is an advertisement.

[ 02 ]Current numbers

benchmark	metric	y0-fast	y0-deep	notes
LongDoc-Faithful	claim accuracy vs. source, 80–120 page docs	91.2%	97.4%	Judged by y0-judge against frozen rubric rb_faith_v2; human-calibrated quarterly.
SchemaExact	valid-JSON extraction, strict schema match	96.8%	98.9%	Invoices, contracts, and forms; a single wrong field fails the whole case.
PlanBench-Y0	plan executes within declared max_steps	78.5%	93.1%	Multi-step operational tasks; failure includes both overrun and stall.
GraphRecall@5	retrieval precision over 10k-item private corpora	88.3%	88.3%	Retrieval layer is shared; measured with y0-embed-l and hybrid filters.
AgentComplete	end-to-end task completion, scopes enforced	71.9%	86.2%	Includes correctly halting at approval gates; partial credit not awarded.
TraceReplay	byte-identical trace replay across releases	100%	100%	A regression here blocks release unconditionally; it has fired twice.

Models API reference

Keep exploring

[prev]Y0 Evaluationfamily 08