Y0 Benchmarks

available

The numbers we hold ourselves to, published as measured.

[ 01 ]Spec sheet

statusavailable

This page is not a model — it is the scoreboard. Mynd maintains an internal benchmark suite that every Y0 release must clear before promotion, and we publish the current results here rather than quoting the public leaderboards everyone has learned to discount. The suite is built from the work the platform actually does: long-document faithfulness, schema-exact extraction, plan quality under enforced step ceilings, retrieval precision over realistic private corpora, and end-to-end agent task completion with scope checks on. Two columns matter: y0-fast, the interactive profile most requests use, and y0-deep, the deliberate profile behind reasoning and agent planning. Numbers are re-measured on every release by the evaluation family, judged against frozen rubrics, and the history is kept — when a number moves, the changelog says why. Read the notes column; a benchmark without its caveat is an advertisement.

[ 02 ]Current numbers

benchmarkmetricy0-fasty0-deepnotes
LongDoc-Faithfulclaim accuracy vs. source, 80–120 page docs91.2%97.4%Judged by y0-judge against frozen rubric rb_faith_v2; human-calibrated quarterly.
SchemaExactvalid-JSON extraction, strict schema match96.8%98.9%Invoices, contracts, and forms; a single wrong field fails the whole case.
PlanBench-Y0plan executes within declared max_steps78.5%93.1%Multi-step operational tasks; failure includes both overrun and stall.
GraphRecall@5retrieval precision over 10k-item private corpora88.3%88.3%Retrieval layer is shared; measured with y0-embed-l and hybrid filters.
AgentCompleteend-to-end task completion, scopes enforced71.9%86.2%Includes correctly halting at approval gates; partial credit not awarded.
TraceReplaybyte-identical trace replay across releases100%100%A regression here blocks release unconditionally; it has fired twice.