How to Build an Effective Runbook for On-Call

How to Build an Effective Runbook for On-Call

How to Build an Effective Runbook for On-Call

The dos and don'ts of creating runbooks for your on-call

The dos and don'ts of creating runbooks for your on-call

The dos and don'ts of creating runbooks for your on-call

Wilson Spearman

Wilson Spearman

Co-founder of Parity

Aug 23, 2024

Automated runbook execution gif
Automated runbook execution gif

First, when and why should you create a runbook?

Runbooks can be a core piece of incident management for software development teams. They help prevent human error, defined standard operating procedure, and allow teams to quick resolve common incidents. It's important to provide the appropriate resources to solve underlying issues. But runbooks are a great way to solve recurring problems and share knowledge between team members.

Example runbook steps

The Best Practice for Writing a Runbook

Runbooks should provide the on-call engineer with an overview of how to handle a given alert or type of problem. They should be foucsed on solving a single problem or alert. They lay out the steps need to validate the issue, determine its severity, and respond to the issue. Since someone could be following these after an alert at 2am or during a stressful incident, it's essential to make sure they're clear, easy to understand, and most of all accurate.

When first writing runbooks, it can be overwhelming to document every single step in full detail. A better starting point is to focus on the big picture overview for how to solve a given problem. What dashboards does someone need to look at? How can they validate the problem? Are there commands they should run? As a run book matures, flesh out more details.

One of the best ways to build up runbooks is to include writing them as a part of your on-call retrospective. The outgoing engineer should be responsible for tracking which issues came up throughout the week. It's then easy to determine which issues are covered by existing runbooks, which need a solution to solve the root cause, and which need a new runbook.

Consider the End User

If you're on a small team of three very senior engineers, your runbooks will look different than those for a team with 10 engineers of varying levels. Consider the end user of the runbook when building it. Will they need extremely detailed steps? Or can they handle an open-ended investigation to find the root cause on their own? If the runbook is for less senior engineers, make sure each runbook step is prescriptive and won't allow for human error. The worst thing that could happen is someone trying to follow a runbook ends up making what would've been a no-brainer common problem a massive incident.

Runbook for pod in crashloop backoff

Using Parity for Runbook Automation!

Just because your team has built up a library of runbooks, that doesn't mean on-call disappears. There remains a manual process required to solve even the common problems documented in runbooks. The on-call engineer still has to follow the runbook each time the corresponding alert triggers.

Runbook automation with Parity

Parity provides runbook automation that leverages AI. When your monitoring system triggers an alert, Parity autonomously follows a runbook entry to respond to the alert as the first step in your incident response process. Parity also provides a detailed execution to the on-call engineer so they can check each runbook step. They can use this if they need to investigate the issue further. Since Parity takes care of runbook execution on your behalf, engineering teams have more time to fix underlying issues. On-call is less time consuming and frustrating for engineers.

How Runbooks Help Team Members

Runbooks are an essential part of knowledge sharing, and are especially useful for onboarding new team members. Well-documented runbooks make it much easier for engineers to take on their first on-call rotation. It's important for teams to create runbooks as new workflows and problems are established, so a runbook template is great way to help your team establish best practices.

If you put in the effort to have well-document runbooks that are genuinely useful for solving on-call problems and determining root cause, new team members won't need to wait nearly as long to hop on their first on-call rotation. It definitely takes establishing good habits as a team, but it can greatly reduce human error and improve incident management.