Model behavior

boundaries enforced by code, not by tone

We do not train frontier models; we operate them inside a runtime. That makes our behavior problem different from a lab's: we cannot change what a model is, but we fully control what it can touch and what its output is allowed to do. Our position is that behavioral boundaries enforced by architecture beat behavioral boundaries requested in a system prompt. A prompt is a wish; a scope check is a wall. Where behavior cannot be walled — tone, judgment, what the assistant chooses to surface — we write it down as a spec, test it like a feature, and accept that residual unpredictability is a property of the medium we work in. The user is told what is guaranteed and what is merely intended, and the two are never blurred.

[ what we actually run ]

[01]

Architecture over instruction

Every behavioral guarantee we make is enforced outside the model: scope checks in the trust kernel, output filters on consequential actions, deterministic execution for anything irreversible. System prompts shape behavior; they are never the load-bearing control.

[02]

A written behavior spec, tested in CI

How Y0 addresses the user, when it hedges, when it refuses, when it escalates to a human decision — all of it is specified in a document and exercised by behavioral test cases that run like any other test suite. When behavior and spec diverge, one of them is changed deliberately.

[03]

Refusal with a reason

When the runtime declines a request — out of scope, out of budget, out of capability — it says which boundary was hit. Unexplained refusals train users to rephrase until something works, which is exactly the behavior an injection attacker exploits.

[04]

No simulated certainty

Outputs derived from retrieval cite their sources; outputs that are generated judgment are labelled as judgment. The product never dresses a guess in the typography of a fact — this is checked in the same rubrics that score hallucination.

[ open questions — honestly ]

  • Upstream model updates can shift tone and judgment in ways our behavioral tests only partially catch. We measure what we thought to encode; the misalignment we did not anticipate is, by construction, not in the suite.
  • An assistant that is too agreeable quietly transfers decisions from the user to the model. We want Y0 to push back when the user is about to do something they will regret — and we do not yet know how to specify 'appropriate disagreement' precisely enough to test it.