Trust — Incidents/incidents

The complete history, with what broke, how long it took us to notice, and what changed afterward. Severity machinery lives at /security/incident-response.

[ 01 ]History

[ INC-003 ]2026-05-28Gateway38 min✓ resolved

Elevated 5xx errors at the API edge

[ postmortem ]

A deploy shipped with a connection-pool ceiling sized for staging, not production. Under normal evening traffic the pool exhausted within minutes and the edge returned 502s for roughly a third of requests. We rolled back in 9 minutes; the remaining time was queue drain. The fix was boring and real: pool limits are now part of config review, and a synthetic load check runs against every release candidate before it can promote. No data was lost and no runs were double-executed.

[ INC-002 ]2026-04-22Context Graph1h 12m✓ resolved

Slow retrieval during index rebuild

[ postmortem ]

A scheduled index rebuild took a write lock that our read path was not supposed to wait on — but a regression three weeks earlier had quietly re-coupled them. Retrieval latency climbed from ~80ms to multi-second for just over an hour. We aborted the rebuild, restored read throughput, and re-ran the rebuild against a shadow index overnight. The regression had shipped without a test; that test now exists, and rebuilds run shadow-first as standard practice.

[ INC-001 ]2026-03-15Y0 Runtime26 min✓ resolved

Run queue stall from a bad scheduler canary

[ postmortem ]

A canary build of the run scheduler deadlocked on a rare retry path and the canary slice of the queue stopped draining. Detection took longer than it should have — 11 minutes — because our alert measured throughput, not queue age. We killed the canary, the queue drained, and every stalled run completed without manual intervention. Two changes followed: queue-age alerting with a 90-second threshold, and canaries now auto-revert on stall instead of waiting for a human.

[ 02 ]What changed

Each postmortem produces actions with owners. The standing list from the incidents above:

[01]

Pool limits are part of config review; a synthetic load check runs against every release candidate before it can promote.

[02]

Index rebuilds run shadow-first as standard practice, and the read/write coupling regression now has a test.

[03]

Queue-age alerting with a 90-second threshold — we alert on queue age and error ratios, not just throughput.

[04]

Canaries auto-revert on stall instead of waiting for a human.

EVERY INCIDENT,EVERY POSTMORTEM.

Elevated 5xx errors at the API edge

Slow retrieval during index rebuild

Run queue stall from a bad scheduler canary