Safety research

publish what would have saved us a month

Our research program is small and applied: we study the failure modes of the system we actually run, and we publish what we learn. The bar for publishing is concrete — would this note have saved us a month if someone else had written it first? We are honest about what we are not: we are not a frontier alignment lab, we do not work on existential-scale questions, and pretending otherwise would be costume. Where our scale gives us a real edge is in the unglamorous middle layer — how injection behaves in real document pipelines, how task-scoped permissions hold up against chained workflows, how retention metrics interact with trust. That is the layer most production assistants live in and almost nobody documents, so that is where we dig.

[ what we actually run ]

[01]

Public research notes from production failures

When a safety mechanism fails or surprises us, the writeup is published as a research note — mechanism, failure, fix, residual risk — with enough detail to be reproduced. The embarrassing parts stay in, because that is where the information is.

[02]

The injection corpus as a research artifact

Our versioned injection corpus is studied, not just executed: we classify payloads by mechanism, track which classes our defenses age against, and document the taxonomy publicly so smaller teams do not have to rediscover it.

[03]

Trust-kernel instrumentation as a dataset

The kernel's scope-check logs — what was requested, what was granted, what was denied — form a longitudinal record of how a permissioned runtime behaves under real workloads. We analyze it for drift and near-misses, and publish the aggregate patterns.

[04]

Replication before novelty

Before we claim a finding, we replicate it across model providers and across task families. A result that only holds for one model is a configuration note, not a finding, and we label it accordingly.

[ open questions — honestly ]

  • Publishing detailed injection taxonomies helps defenders and attackers alike. We currently publish mechanisms but hold back working payloads, and we are not certain that line is drawn in the right place.
  • Applied safety research at one company generalizes only as far as that company's architecture. We do not know which of our findings are facts about permissioned runtimes and which are facts about ours.