Root Cause Analysis and Postmortem

Estimated time to read: 9 minutes

"Root Cause Analysis" and "Postmortem" are two strategies used in problem-solving, usually in the fields of business, engineering, or software development. Both strategies are centred around identifying the cause of a problem or an issue, but they are used in different contexts and have slight differences in approach.

Root Cause Analysis (RCA)¶

This process is often used to identify the fundamental cause of a problem or fault. The goal is to find the 'root' cause and then address that issue, to prevent the problem from recurring in the future. RCA typically involves a systematic approach where you look beyond the immediate causes and dig deeper into any underlying issues that led to the problem. It's often used proactively to improve business processes and reduce errors or issues. The RCA process can utilise several techniques, such as the "5 Whys" method, "Fishbone" or Ishikawa diagrams, and fault-tree analysis.

People Involved in RCA¶

The people involved in the RCA process would typically include those who are knowledgeable about the problem or the process in which the problem occurs. This could include team members involved in the day-to-day work, supervisors, process owners, or quality assurance team members. It is also helpful to include a diverse group of people who can bring different perspectives to the process.

Methods for Conducting RCA¶

Several methods and tools can be used to conduct a Root Cause Analysis, and the choice often depends on the complexity and nature of the problem. Here are a few examples:

5 Whys¶

This simple technique involves asking "Why?" repeatedly until the root cause is identified. Each "why" typically peels away a layer of the problem.

Fishbone Diagram (also known as Ishikawa or Cause-and-Effect Diagram)¶

This tool helps to visually display the potential causes of a problem. The problem is identified on the right, and the main causes are identified as branches from the "spine" of the fish, with potential sub-causes branching off the main causes.

Fault Tree Analysis (FTA)¶

This is a top-down, deductive failure analysis in which an undesired state of a system is analyzed.

Pareto Analysis¶

This technique is used to identify the causes of problems that need to be addressed first, based on the principle that a small number of causes often account for a large majority of problems.

How to start a Root Cause Analysis¶

The process typically begins when a problem is identified. From there, a systematic set of steps are followed to uncover the root cause of that problem.

Define the Problem¶

Clearly define the problem in measurable terms. Understand the who, what, where, when, and how of the problem. Document all details related to the problem.

Gather Data¶

Collect all relevant data about the problem, including when it occurs, how often, its impact, and all circumstances surrounding its occurrence.

Identify Possible Causal Factors¶

Look at all the potential factors that might have contributed to the problem. Here, brainstorming with the team involved in the problem or process can help to identify all possible factors.

Identify the Root Cause(s)¶

Analyze the causal factors identified in the previous step to pinpoint the underlying root cause of the problem. Often a "why" analysis or cause-and-effect analysis can be helpful.

Implement Corrective Actions¶

Develop a plan to address and eliminate the root cause, then implement the plan.

Verify the Effectiveness of the Corrective Actions¶

Monitor the situation to ensure the plan is working and that the problem has been effectively resolved.

The goal of RCA is not to assign blame but to understand what happened, why it happened, and how it can be prevented from happening again in the future. Also, RCA is often not a one-off process. it's iterative and may need to be repeated as more information is discovered, or if initial fixes don't fully resolve the problem. It's also a key component of continuous improvement programs. One of best practices is to keep a lesons learnd list.

Postmortem¶

Postmortem is typically conducted after a project or event has concluded, or a major problem has occurred, to analyze what happened, why it happened, and how to prevent similar issues in the future. It's often used in software development following a major issue like a service outage or a project that did not meet its goals. In this process, all aspects of the situation are examined in order to learn lessons and improve future performance. While a postmortem also looks for root causes, it also includes a wider evaluation of what was done well, what could have been better, and how to improve for next time.

Postmortem Culture¶

Google's SREs view incidents and outages as inevitable given the scale and velocity of change in their systems. When an incident occurs, they fix the underlying issue, and services return to their normal operating conditions. However, unless there's a formalized process of learning from these incidents, they may recur indefinitely. Therefore, postmortems, which are written records of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring, are an essential tool for SRE. Postmortems are expected after any significant undesirable event, and writing a postmortem is not punishment—it's a learning opportunity for the entire company. Furthermore, postmortems are blameless, focusing on identifying the contributing causes of the incident without blaming any individual or team for bad or inappropriate behavior.

What is a Postmortem procedure?¶

A Postmortem, also known as a Root Cause Analysis (RCA), is a process for understanding and documenting the root cause of a failure in a system. The goal of the Postmortem process is to learn from the failure and prevent it from happening again in the future. It should not be a process for assigning blame. Instead, it should focus on what went wrong, why it went wrong, how to fix it, and how to prevent the same problem from happening again.

People Involved in a Postmortem¶

Several roles are typically involved in a postmortem.

Author The person responsible for writing the postmortem. This is often the person who was on-call and responded to the incident.

Reviewers These are subject matter experts who review the postmortem for accuracy and completeness. They help ensure the postmortem has all the necessary information and the proposed remediation steps are appropriate.

Approver This is often a team lead or manager who approves the postmortem once it's complete.

Stakeholders These are people who were affected by the incident or who have an interest in the outcome of the postmortem. They can include people from other teams, management, and potentially even customers.

Methods for Conducting a Postmortem

Google's SRE book suggests a structured method for conducting a postmortem. Here are the main steps:

Collect data This includes any logs, monitoring data, error messages, user reports, or any other data related to the incident.

Create a timeline This is a chronological list of events leading up to, during, and after the incident. This helps everyone understand the sequence of events and can help identify where things went wrong.

Determine the root cause This is a careful analysis of the data and timeline to determine the underlying cause of the incident. The goal is not to assign blame, but to understand what went wrong.

Propose remediation items These are specific actions to prevent the same incident from happening again in the future. They can include things like changes to code, changes to infrastructure, updates to documentation, or improvements to monitoring and alerting.

Write the postmortem This is a document that describes the incident, the root cause, and the proposed remediation items. It should be written in a blameless manner and focus on learning from the incident.

How to Start a Postmortem¶

Starting a postmortem usually involves the following steps:

Identify the need for a postmortem Not all incidents require a postmortem. Generally, if an incident has caused a significant disruption to users or has revealed a flaw in the system that could lead to future disruptions, a postmortem is warranted.

Assign roles Decide who will be the author, reviewers, and approver for the postmortem.

Collect data Start gathering all the relevant data related to the incident.

Create the timeline Start putting together the timeline of events.

Begin the analysis Start analyzing the data and timeline to understand what went wrong.

Remember, the goal of a postmortem is not to blame individuals, but to learn from mistakes and improve systems and processes. It's about fostering a culture of learning and continuous improvement.

Embracing Risk¶

Google doesn't try to build 100% reliable services, because extreme reliability comes with costs. Maximizing stability can limit the speed at which new features can be developed and delivered to users and can dramatically increase their cost. Users typically don't notice the difference between high reliability and extreme reliability in a service because the user experience is dominated by less reliable components like the cellular network or the device they are using1.

Managing Risk¶

In Site Reliability Engineering (SRE), risk is managed by balancing the risk of unavailability with the goals of rapid innovation and efficient service operations. The cost of increasing reliability doesn't increase linearly; an incremental improvement in reliability may cost significantly more than the previous increment. This costliness has two dimensions: the cost of redundant machine/compute resources, and the opportunity cost, which is the cost borne by an organisation when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users1.

Measuring Service Risk¶

At Google, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime, which is expressed in terms of the number of "nines" they would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement toward 100% availability. However, for globally distributed services, instead of using metrics around uptime, Google defines availability in terms of the request success rate

In summary, while both methods aim to uncover the causes of problems, Root Cause Analysis is usually more narrowly focused on identifying and addressing the core cause of an individual problem, while a Postmortem is a broader review of performance, problems, and success factors after a project or major incident.