DevOpsArchitectureCareer

The Post-Mortem That Changes Nothing

postmoterm

The incident is over. The service is back up. The customers who noticed have been responded to. Someone has sent the internal message that says things are stable and a post-mortem will follow.

A week later, twelve people sit in a room or a video call. Someone shares a timeline of what happened. There is a discussion about root cause. Someone writes action items. The meeting ends. The document is filed somewhere.

Three months later, the same category of incident happens again.

This is not unusual. It is, in fact, the normal outcome of a post-mortem process in most engineering organisations. Not because the people involved are negligent or incompetent. Because the process is designed, almost universally, to do something other than prevent future incidents.

It’s designed to produce closure.

What closure actually is

Closure, in the incident context, means: we have an account of what happened that everyone agrees on, we have identified the thing that went wrong, and we have assigned someone to fix it. The incident is now resolved not just technically but narratively. We understand it. We have responded to it. We can move on.

This is psychologically necessary. Incidents are stressful. The people who were paged at 2am, who spent three hours debugging under pressure, who sent the message to customers explaining the outage — these people need to be able to put the incident down. Closure lets them do that.

The problem is that closure is not the same as learning. And the mechanisms that produce closure often actively prevent learning.

The blameless post-mortem, which has become the standard format, was a genuine improvement over what came before it. The version before it was a blame assignment session: what went wrong, who caused it, what will be done to that person. This was destructive for obvious reasons — it discouraged honesty, it identified individual error rather than systemic failure, and it produced fear rather than improvement.

Blameless post-mortems fixed the blame problem. They didn’t fix the closure problem. A blameless post-mortem still ends with a timeline, a root cause, and action items. The format optimises for producing a satisfying account of the incident. The satisfying account produces closure. The closure ends the inquiry.

The five whys problem

The five whys technique is the dominant root cause analysis framework in software engineering post-mortems. Ask why something happened, then ask why that thing happened, then ask why that thing happened, five times, until you reach something that feels fundamental.

The technique has a structural flaw: it produces a single causal chain.

Real incidents do not have single causal chains. They have multiple contributing factors — technical failures, process gaps, knowledge gaps, timing coincidences — that intersected to produce the outage. A single causal chain tells a coherent story. Coherent stories produce closure. But the coherent story is a simplification of what actually happened, and the parts that got simplified away are often the parts that will cause the next incident.

Consider: a service went down because a database query started timing out after a deployment. The five whys might produce: the query timed out because a missing index caused a full table scan; the index was missing because the migration script didn’t include it; the migration script didn’t include it because the developer didn’t know the query needed an index; the developer didn’t know because there was no automated query analysis in the deployment pipeline. Root cause: no query analysis tooling. Action item: add query analysis to the pipeline.

That’s a coherent chain. It’s also incomplete.

The same incident probably also involved: a staging environment that didn’t have production-representative data volumes, so the slow query wasn’t caught in testing; an alerting setup that triggered on error rate rather than latency, so the first alert came five minutes after users were already experiencing timeouts; an on-call rotation that had the least experienced engineer on call for that week; a deployment process that didn’t include a query plan review step for database migrations.

Each of these is a contributing factor. None of them appears in the five-why chain. The action item for query analysis is real and valuable. The four other things that contributed to the severity and duration of the incident are not addressed. They will contribute to the next incident.

What the action items actually are

Pull up the action items from the last five post-mortems your team produced. If you don’t have them readily accessible, that’s already informative.

Assuming you have them: how many are closed? Of the closed ones, how many were verified to actually prevent the failure mode they were addressing? Of the open ones, how many have a concrete owner, a concrete definition of done, and a deadline that has not passed?

In most organisations, the honest answers to these questions are uncomfortable. A significant fraction of post-mortem action items are never completed. A significant fraction of completed action items are not verified to prevent the thing they were addressing. A significant fraction of action items are vague enough that “done” is ambiguous.

This is not a failure of individual accountability. It’s a structural problem with how action items are generated.

Post-mortem action items are generated at the end of a meeting, under time pressure, by people who are tired and ready to close the incident. The pressure in the room is toward consensus and completeness, not toward rigor. Action items that feel concrete — “add monitoring for X,” “write a runbook for Y,” “fix the bug in Z” — get added quickly because they feel actionable. Action items that would address the harder systemic issues — “rethink our deployment process,” “change how we communicate schema changes between teams” — are harder to make concrete and often get deferred or softened into something vaguer.

The result is a list of action items that are tilted toward the easy and the specific, and away from the systemic. The easy specific things get done. The systemic issues remain. The next incident involves the same systemic issues with different specific triggers.

The learning that actually sticks

The post-mortems that produce lasting change share a characteristic that’s different from the ones that don’t: they’re designed around questions rather than answers.

A post-mortem designed around answers tries to determine what happened and what will be done about it. The output is a document with a timeline, a root cause, and action items. It is designed to be filed and referenced if the same incident happens again.

A post-mortem designed around questions tries to understand what the incident revealed about the system and the organisation. The output is a set of hypotheses about why the system is fragile in this way, tested against evidence from the incident. It is designed to produce changes to the system, not documentation of the failure.

The difference is subtle but consequential. Here’s what it looks like in practice.

Instead of: “What happened?” Ask: “What did we have to believe about the system for this to be surprising?”

This question is productive because incidents are almost always surprising. If the failure had been anticipated, there would have been a safeguard against it. The fact that the incident happened means someone’s mental model of the system was wrong. Identifying which mental model was wrong, and why it was wrong, is more valuable than describing the timeline of what happened.

Instead of: “What was the root cause?” Ask: “What conditions made this failure possible?”

This question resists the single causal chain problem. Conditions are plural. They include the technical failure, the process failure, the knowledge gap, the timing coincidence. Asking about conditions instead of root cause produces a more complete picture of the incident’s contributing factors.

Instead of: “What action items will prevent this from happening again?” Ask: “What would we have needed to detect this sooner, contain it faster, and resolve it more confidently?”

This question decomposes the prevention problem into three distinct subproblems, each with different solutions. Detection is an observability problem. Containment is an architecture and operations problem. Resolution confidence is a knowledge and tooling problem. Addressing all three produces a more resilient system than addressing only the direct cause. Post-mortem question framework: SURPRISE

What happened that we didn’t expect? What did we believe about the system that turned out to be wrong? Who knew things that others didn’t? Why?

CONDITIONS

What technical conditions made this failure possible? What process conditions made it harder to detect or resolve? What knowledge conditions meant we took longer to diagnose? What were we not measuring that would have made this visible earlier?

RESPONSE

What slowed down detection? What slowed down diagnosis? What slowed down resolution? At what point did we know enough to resolve it? What happened between that point and actual resolution?

SYSTEMIC

Is this failure mode possible elsewhere in the system? What similar incidents have we had? What do they have in common? What would need to be true for this category of incident to not happen?

The two-week rule

Action items from post-mortems should have two properties that most of them don’t have: a concrete definition of done, and a completion date within two weeks.

Two weeks is not arbitrary. It’s the window within which the incident is recent enough that the people involved are still motivated by it. After two weeks, the incident has faded in urgency. It competes with sprint work. It gets deprioritised. An action item that isn’t completed within two weeks has a low probability of ever being completed, because the social pressure of the recent incident — the primary motivator for the work — has dissipated.

If an action item cannot be completed within two weeks, it’s not a post-mortem action item. It’s a project. It should be tracked as a project, with proper scoping and prioritisation, separate from the post-mortem output. Treating it as a post-mortem action item gives it the false appearance of urgency without the actual prioritisation to back it up.

The concrete definition of done is necessary because vague action items produce checkbox completion rather than actual change. “Add monitoring for database query latency” can be completed by adding a metric that nobody looks at, setting an alert threshold that never fires, or building a dashboard that surfaces the right information and gets integrated into the incident response process. Only the third version actually prevents the failure mode. The definition of done needs to distinguish between them. Well-formed post-mortem action item: WHAT: Add p99 latency alerting for all database queries exceeding 500ms. DONE WHEN:

Prometheus metric db_query_duration_seconds exists with query label Alert fires in staging when a test query exceeds threshold Alert is connected to PagerDuty rotation Runbook linked from alert includes query analysis steps

OWNER: [specific person, not team] DUE: [specific date within two weeks] VERIFY: Owner demos the alert firing in staging at next team sync.

The verification step is the one most often skipped. It’s also the most important. An action item that doesn’t include verification has no mechanism to confirm that the thing done actually addresses the failure mode. Verification closes the loop between the action and the outcome.

The incident that teaches the most

The incidents that produce the most learning are not necessarily the most severe. A five-hour outage affecting all customers produces urgency and a full post-mortem, but it also produces noise — the pressure of a major incident fills the post-mortem with timeline details and mitigation discussion that crowds out the deeper systemic questions.

A thirty-minute partial degradation that affected ten percent of users is often more instructive. Small enough to be discussed without the emotional weight of a major incident, significant enough to reveal real system fragility, recent enough that the details are accurate.

The teams that learn fastest from incidents are the ones that treat small incidents with the same systematic inquiry as large ones. Not the same ceremony — a thirty-minute degradation doesn’t need a twelve- person meeting. But the same questions: what did we believe that turned out to be wrong, what conditions made this possible, what would have made it visible sooner.

This requires a cultural commitment that’s easy to describe and hard to maintain: treating incidents as information rather than as failures. The instinct after an incident, especially a public one, is to close it as quickly as possible — fix the immediate problem, write the post-mortem, file it, move on. The information in the incident is uncomfortable to sit with because it implies the system has more fragility than was visible before the incident.

That discomfort is exactly the information you need. The system was always that fragile. The incident made it visible. The post-mortem’s job is to make sure visibility produces change.

What the document is actually for

Post-mortem documents are treated as historical records. They get filed in a wiki or a doc management system and referenced if the same incident happens again. This is a reasonable use of the document. It’s not the primary use.

The primary use of a post-mortem document is as a forcing function for clear thinking during the post-mortem itself. Writing down “what we believed that turned out to be wrong” forces the team to identify specific mistaken beliefs rather than gesturing at a vague sense that things could have been better. Writing a concrete definition of done for action items forces the team to think through what “done” actually means, which is often where the hardest thinking happens.

The document is not the output. The changed system is the output. The document is the tool that produces the changed system, if it’s designed to do that rather than to produce a satisfying narrative.

A post-mortem process that produces good documents and unchanged systems is a post-mortem process that has optimised for the wrong thing.

The test is not whether the document is thorough. The test is whether the next incident of this category happens, and if it does, whether the team is better equipped to detect it sooner, contain it faster, and resolve it more confidently.

That test takes months to run. Most teams never run it. They measure the quality of post-mortems by the quality of the documents, which is measuring the input, not the outcome.

Measure the outcome. Run the test. Be honest about what it shows.

That’s the post-mortem that changes something.