Replayable runs: determinism contracts for agent execution graphs
Abstract
Agent systems are notoriously hard to debug because identical requests produce divergent traces. We describe the determinism contract used by the Y0 runtime: every run is compiled to a static execution graph before any model is invoked, all sources of nondeterminism are reified as recorded edge values, and replay is defined as graph re-evaluation against the recorded frontier. The contract turns incident response from archaeology into diffing.
Problem
A typical agent run interleaves model sampling, tool I/O, clock reads, and concurrent sub-agents. Each is a nondeterminism source, so a failing run cannot be reproduced and a fixed run cannot be verified. Logging more does not help; logs describe what happened without constraining what happens next time.
The contract
The runtime compiles every request into an execution graph before invocation. Nodes are pure transforms; anything impure — a model call, a tool invocation, a timestamp — must occur on an edge, and every edge value is recorded at execution time into the run's frontier. The contract is then a single sentence: re-evaluating the graph against a recorded frontier must reproduce the run bit-for-bit, on any machine, at any later date.
graph run-84f2 n0 parse(request) pure e1 model.call(n0) -> rec frontier[1] n1 plan(e1) pure e2 tool.cal_read(n1) -> rec frontier[2] n2 brief(e1, e2) pure replay(graph, frontier) == output ✓ required
Consequences
Three properties fall out. First, debugging becomes diffing: replay a run with one frontier value substituted and the first divergent node is the cause. Second, regression testing becomes free — recorded frontiers are fixtures, and a runtime upgrade is validated by replaying last month's traffic. Third, the recorded frontier is exactly the surface the trust kernel must audit, so determinism and auditability share one mechanism rather than two.
The cost is expressiveness: control flow that depends on unrecorded ambient state is rejected at compile time. In eight months of production we have found this restriction is the feature — every rejected graph so far has been a latent reproducibility bug.
cite as: Mynd Labs Research Note R-001 (2026)