Fixing Broken Postmortems: Lessons from WePay

#sre#incident-management#postmortems

During my time at WePay, I tackled a problem that plagues many engineering organizations: repeated incidents of the same type. The diagnosis? A broken postmortem process — the way we sit down after an incident and determine how to make it better.

Here’s the framework I used to fix it.

1. Deal with the incidents first

If you never get a breath, or if you are having multiple customer-facing incidents per week per team, you’ve got problems that no single meeting can resolve. The same applies if your DevOps/SRE teams are entirely in reactive mode.

2. Move beyond “root cause”

You’d be lucky if there is a “root cause” for your problem. There’s usually a proximate cause — someone pressed a button, a user exercised a code path, a system underwent routine maintenance. But the root cause search is often a way for organizations to assign blame and avoid making changes.

3. Make it blameless

The goal in SRE practice is a blameless postmortem. “Witch hunts” (surprise!) usually focus on the less experienced, less popular, more marginalized members of a technical team. In the same way, you need to short-circuit anyone’s attempt to be a martyr — “oh, I was dumb to press the button — I’m at fault.” No, that person pressed the button because everything was flashing PLEASE PRESS THE BUTTON and the rest of the screens were dead (metaphorically speaking).

4. Focus on the system

What’s missing? What behaved unexpectedly? What assumptions did we have which were proven wrong? Why are people tired of responding to pages?

5. Document

If there is no archive of what problems your team has solved and overcome, how are people able to learn? Does your team depend on word-of-mouth in a remote or hybrid workplace?

6. Prioritize, then follow up

Each postmortem will likely produce a list of significant tasks. Some might be larger, expensive efforts that could take months to resolve — the duty of SRE is to figure out what gives the best return for the work, and to clearly document and communicate what tradeoffs result. “We’re waiting for solution X to arrive in three months, so during that time, performing Y remains risky and labor-intensive.”

At WePay I ran the first 10–12 postmortems with these guidelines and we revamped the incident command procedures, on-call methods, and the way the engineering teams communicated. If this sounds familiar, I’ve been through it before and can help your team get to the other side.