Incident response

severity matrix + comms commitments

Incidents are not hypothetical here — we have had three, and the postmortems are public. This page is the machinery: how severity is assigned, what response each level guarantees, and what we commit to telling you while it is happening.

How an incident runs

Detection comes from automated alerting (we alert on queue age and error ratios, not just throughput — a lesson INC-001 taught us), from external probes, or from reports. The responder declares severity from the matrix below, opens an incident channel, and the clock starts on the communication commitments. Mitigation beats diagnosis: we roll back first and understand second. Every incident at SEV-3 or above gets a written postmortem; SEV-1 and SEV-2 postmortems are published at /incidents, including detection times that embarrass us.

Communication commitments

  • Status acknowledged publicly within 30 minutes of a confirmed SEV-1 or SEV-2.
  • Updates at least every 60 minutes while a SEV-1 or SEV-2 is open — even when the update is 'still investigating'.
  • If customer data was accessed or exposed, affected customers are notified directly within 72 hours, and earlier where contracts or law require it.
  • A public postmortem within 10 working days of resolution for SEV-1 and SEV-2, including what changed afterward.

After the incident

Each postmortem produces 'what changed' actions with owners. The three incidents to date produced pool-limit config review, synthetic load checks on release candidates, shadow-first index rebuilds, queue-age alerting, and auto-reverting canaries. An action item that is still open at the next incident review is treated as a new incident in slow motion.

[ Severity matrix ]

leveldefinitionresponsepublic comms
SEV-1Full outage of a core service, or any confirmed unauthorized access to customer dataImmediate page, all-hands until mitigatedWithin 30 min, updates every 60 min, direct customer notice if data is involved
SEV-2Partial outage or major degradation — errors or latency affecting a significant share of requestsImmediate page, dedicated responder until resolvedWithin 30 min, updates every 60 min
SEV-3Minor degradation with workaround, or a single non-core service impairedTriaged same working dayStatus page note; postmortem internal unless customer-visible
SEV-4Cosmetic faults, isolated errors, near-misses worth recordingTracked and batchedNone — recorded in the internal log