Incident response
severity matrix + comms commitments
Incidents are not hypothetical here — we have had three, and the postmortems are public. This page is the machinery: how severity is assigned, what response each level guarantees, and what we commit to telling you while it is happening.
How an incident runs
Detection comes from automated alerting (we alert on queue age and error ratios, not just throughput — a lesson INC-001 taught us), from external probes, or from reports. The responder declares severity from the matrix below, opens an incident channel, and the clock starts on the communication commitments. Mitigation beats diagnosis: we roll back first and understand second. Every incident at SEV-3 or above gets a written postmortem; SEV-1 and SEV-2 postmortems are published at /incidents, including detection times that embarrass us.
Communication commitments
- Status acknowledged publicly within 30 minutes of a confirmed SEV-1 or SEV-2.
- Updates at least every 60 minutes while a SEV-1 or SEV-2 is open — even when the update is 'still investigating'.
- If customer data was accessed or exposed, affected customers are notified directly within 72 hours, and earlier where contracts or law require it.
- A public postmortem within 10 working days of resolution for SEV-1 and SEV-2, including what changed afterward.
After the incident
Each postmortem produces 'what changed' actions with owners. The three incidents to date produced pool-limit config review, synthetic load checks on release candidates, shadow-first index rebuilds, queue-age alerting, and auto-reverting canaries. An action item that is still open at the next incident review is treated as a new incident in slow motion.
[ Severity matrix ]