Runbook: Reduce business continuity risks and bus factors

Amir So
4 min readJan 16, 2024

--

In operations, we handle both planned and unplanned tasks with confidence. When we encounter incidents or tickets with unknown solutions, we take charge and strive to find the best possible resolution. To resolve an issue, we have several options available to us. We could perform a quick Google search, consult the company wiki or documentation, check for shared scripts in appropriate locations, ask a coworker for help, or escalate the issue to a different department. However, attempting to solve a problem without following the company’s best practices could be time-consuming and may not yield the desired outcome.

It’s all about simplifying complexity!

Any team with any profession (Sales, Marketing, HR, etc.) can have runbooks; my main focus is on the tech side of it. Back in time, I remember a recurring issue/incident that I always needed to do something similar at least once a month (Please don’t ask why we didn’t fix the issue; That’s a separate story :P), and I knew I was the only one who knew how to deal with that, and no one else could handle it — Yeah Bus Factor. I prepared and shared a runbook with the team (Github Wiki), but they still needed to go through all the steps because the document is not executable, and steps must be followed by a person one by one (I will get into it a bit later), and it could take time and sometimes go wrong.

What is a runbook?

Runbook is a set of instructions created for a person who is not familiar with a particular system or workflow due to being new to it because they were paged in the middle of the night — Everyone wants to get back to bed ASAP 😴 — because they had written the instructions/documents but hadn’t worked on the system for a few months or years, or they don’t really want to burn calories to remember those steps!

Runbook 📝 vs 📖 Playbook

If a Runbook is a recipe, the Playbook would be the guidebook for hosting a social event. The recipe is needed to cook the meals effectively, but the food is just one aspect of the entire event.

The playbook takes into account the big picture, while a runbook documents the instructions to complete a specific task. If you need to document a large process, hand off an entire project, or spend too much time on a project that can be delegated, go for a playbook. However, go for the runbook if you need to document a specific task or are spending too much time explaining step-by-step instructions to your teammates or directs.

If you’re an engineering manager, you can have a playbook for onboarding and training a new engineer during their first 90 days. The playbook can include information on the welcome emails, scheduling 1:1, granting permissions, and a team presentation to go out on a new hire’s first day. Another one would be off-boarding an IC. And perhaps have a list of runbooks for different scenarios and requests (not only incidents).

Many engineers keep their own runbooks saved on their computers to simplify their work processes. At some point, they may share these runbooks with their team, and benefit from auditing and updating them regularly.

Reduce Risks and Bus Factors

At the beginning of this blog post, I shared an example from my experience that can explain the bus factor very well, as well as spending too much time periodically doing the same thing. Creating and utilizing runbooks is crucial to mitigate bus factor and reduce risk. Think of runbooks as your organization’s proactive guidebook, detailing step-by-step procedures for common operational tasks and issue resolution. By documenting and regularly updating these runbooks, you empower your team, both seasoned professionals and newcomers, with the knowledge and steps needed to swiftly and accurately address various scenarios.

Create a runbook

Modern implementations have introduced the concept of executable runbooks where, along with a well-defined process, operators can execute pre-written code blocks or database queries against a given environment.

The best time to create a runbook is before it is needed as part of an incident response. Take advantage of new people joining your team and have them follow experienced team members and document your procedures. This is both training and preservation of institutional knowledge by capturing it in runbooks.

I consider a runbook good when it is simple, understandable, dedicated to specific procedures, appropriately version-controlled, can only be executed by authenticated and authorized people and resources, easily executable, and, most importantly, well-maintained and updated!

In this post, I’m not going to dive into the technical aspect, but I suggest taking a look at Gitlab's Runbooks document for inspiration on developing and preparing an environment to manage Executable Runbooks.

Understand the main idea so that you can express it in different ways.

Evolve a runbook

The first iteration, or the first few iterations of a runbook, will likely be triggered by a manual process. For example, tasks to recover a media server that has crashed should be performed by a human before being automated or fixing a state of a process that got lost due to a bug!

The runbook can be activated through an API or a ticketing system, which allows for improvement as the process evolves. As the runbook evolves, the automated response can take on more of the responsibilities of the engineer in charge of executing the runbook and may eventually automate the entire process.

Don’t forget manual steps are error-prone, and humans make mistakes. So, I’d recommend having executable runbooks after a few iterations.

Thanks for reading!❤️

References

--

--