Navigating the Chaos: A Holistic Approach to Incident Management

Rate this content
Bookmark

Incident management can be challenging and throw you curveballs with unexpected issues, resulting in data loss, downtimes, and overall money & hours of sleep going to waste, BUT! There are practical things you could do to make it a smoother process and handle it better.
Remember when we were at school, and people said - "Actively listening in class guarantees 50% prep for the upcoming test"?
The same goes for being proactive at work in ways that will instantly prepare you to manage incidents better (at night or in general).
In this talk, I'll cover the proactive ways you could take and incorporate into your day-to-day routine, in order to prepare you for a smoother and more efficient incident management process.
I will also show the best practices I've finalized over the years that helped me get a clear vision of how to manage production incidents in the quickest & efficient way possible.
Embracing the tips I'll give you will guarantee you'll not only talk the talk but also walk the walk when it comes to incident management.

This talk has been presented at DevOps.js Conf 2024, check out the latest edition of this Tech Conference.

FAQ

Incident management is a set of procedures and actions taken to resolve a critical incident. It involves detecting and communicating incidents, assigning responsibility for handling them, utilizing tools for investigation and response, and executing steps towards resolution.

A proactive approach helps in managing production incidents by preparing for incidents before they occur, improving the mean time to resolution, reducing costs by minimizing downtime, and preserving the business's customers and reputation.

Having a business mindset in incident management means understanding the broader impact of incidents on the business, which includes the potential loss in revenues, customers, data, and reputation. This mindset helps in prioritizing actions and decision-making processes that align with business goals.

The structured process in incident management includes five pillars: identify and categorize the incident, notify and escalate to the relevant parties, investigate and diagnose the issue, determine and apply remediation steps, and review and update the incident runbooks and alerts post-incident.

A postmortem after a critical incident should include a detailed discussion about what went wrong, what was done to resolve the issue, lessons learned, and actions that can be taken to prevent future incidents. It's important to conduct this in a blame-free manner to focus on improvement.

The philosophy 'Everything fails all the time' reminds us that failures are inevitable in any system. Adopting this mindset in incident management prepares teams to handle failures proactively rather than reactively, minimizing downtime and reducing the impact on business operations.

Best practices for conducting a war room during a critical incident include maintaining focus on relevant information, assigning clear roles and responsibilities, reducing noise by limiting unnecessary participation, and ensuring documentation like runbooks are available and up to date for efficient handling of the incident.

On-call shift handoffs improve incident management by providing subsequent teams with a summary of incidents handled, actions taken, and any unresolved issues. This ensures continuity and preparedness, helping teams respond more effectively to incidents.

Hila Fish
Hila Fish
26 min
15 Feb, 2024

Comments

Sign in or register to post your comment.

Video Summary and Transcription

This talk covers the importance of a structured process for incident management and the need for a business mindset. It outlines a five-pillar structured process and emphasizes the importance of staying calm and asking the right questions during incidents. The talk also highlights the importance of effectively identifying, categorizing, and investigating incidents, as well as prioritizing root causes and communicating incident resolutions. Additionally, it discusses the role of incident managers, proactive measures for continuous improvement, and the importance of preparation and a proactive mindset.

1. Introduction to Incident Management

Short description:

In this part of the talk, I will cover the importance of a structured process for incident management and the need for a business mindset. I will also emphasize the rule of knowing that everything fails all the time.

Hi everyone. Thank you for joining my talk about navigating the chaos, aka production incidents. When I was in high school, the common belief was that if you were actively listening in class, you will have 50% of the exam prep already in your pocket. I want to show you how I adopted this belief to an actual proactive approach that you could take that will help you manage incidents more efficiently in a more structured way and eventually preserve much needed hours of sleep.

But first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer. I have a lot of things to say about myself, but basically the most important thing that you need to know about myself in terms of this presentation is that I handled a lot of production incidents when I was on call. When I wasn't on call, big corporates, startups, I've seen a lot of things. So this is why I'm able to bring the things that I learned along the way to this presentation.

So let's cover the agenda for today. We will first of all cover the mindset that you should have in order to really practice and manage incidents efficiently. Then we will cover the incident flow, aka a structured process that you can do and take in order to manage incidents efficiently and being proactive, things that you can do in your day to day and after an incident took place that will help you come prepared for the next incident that will happen.

So first of all, let's set a baseline here. Incident management is a set of procedures and actions taken to resolve a critical incident. And it basically means that it is an end-to-end process that defines how incidents are detected and communicated, who is responsible to handle them, what tools are used to investigate and respond to them, and what steps are taken towards a solution. And a thing that we really need to think about when we come to deal with incidents in production is first of all, that not all pages that you get through Ops Uni or PagerDuty or any other tool that you're using, not all pages become an incident. When it is an incident, when you have a loss or potential loss in revenues, customers, data, and reputation.

And if we don't have incident management process, if we are just in an ad hoc putting out files kind of approach and mindset, it means that we will potentially lose valuable data, downtime could potentially lead to reduced productivity and revenues, and the business could hold a bridge of service level agreements. Every company has its own SLAs and we want to avoid breaching those SLAs. So it means that we need to avoid being in an ad hoc manner of, okay, something happened, now we need to put out this file. We need to have a structured process towards resolving incidents.

And how do we do that? First of all, with reframing our perspective, we need to have business mindset. Meaning that whenever you deal with something, whenever you implement something at work, whenever you do anything at work, you need to think not only about the systems that you are incorporating and implementing, but also understanding the why. Why we're doing things in a certain way? Why do we have this system? How does it help us do things and how does it help the business succeed? So business mindset is needed in order to grasp the overall impact of incidents and mitigate damages accordingly. And this is why it has to be a structured process. This is where you will incorporate the business mindset and make sure that things are handled as quickly as possible for the sake of the business.

And in order to really, let's say, manage incidents efficiently, the number one rule of managing incidents and to be a better engineer in general is to know that everything fails all the time. Who said this, by the way? You would ask. First of all, me. I said that a lot throughout my career.

2. Structured Process and Types of People

Short description:

Incidents are mayhem by nature. Having a structured process will help prevent incidents, improve mean time to resolution, reduce costs, and preserve the business and reputation. We will follow a structured process with five pillars. I will share questions to ask and answer in each pillar to progress towards incident resolution. Two types of people in incident management: those who stay calm and those who can't. Asking the right questions will help you stay calm and progress through each phase.

But first of all, I think it is very odd to quote myself. And second of all, I found someone with a little bit more credit in the industry than me. AWS CTO, Werner Vogels. So take his word for it. Everything fails all the time. Production systems, development environments, pipelines, things that we build, things that we buy, systems that we rely on to know that production is down, aka our monitoring systems. Even us as human beings, we crash and we need to sleep and then restart ourselves. Everything fails all the time. And that's exactly it. Incidents are mayhem by nature. But if we have this fact, if we know that failures are a given because everything fails all the time, then we can't be in an ad hoc manner of putting out files. We can say, OK, this happened, but I'm prepared to deal with it. So this is the whole idea of having a structured process and a structured process will help leading to incidents prevention, improved mean time to resolution, cost reduction because downtime was reduced or eliminated entirely, and to preserve our business customers and reputation. And how do we do that? We have a structure process that we can follow here, five pillars, and we'll go over each pillar and I will show you questions that you can ask and answer in each pillar in order to progress to the next one. But beforehand, I want to show you two types of people that I met in my entire journey in production, managing production incidents. I met this type. First of all, the one that says keep calm, I'm an engineer. And the other type that says I can't keep calm. I'm an engineer. These are the two types that I met. And I say that you can keep calm if you ask yourselves the questions I'm going to share with you in each pillar and then it will help you progress towards the next phase and up until the incident resolution.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them
React Advanced Conference 2021React Advanced Conference 2021
39 min
Don't Solve Problems, Eliminate Them
Top Content
Kent C. Dodds discusses the concept of problem elimination rather than just problem-solving. He introduces the idea of a problem tree and the importance of avoiding creating solutions prematurely. Kent uses examples like Tesla's electric engine and Remix framework to illustrate the benefits of problem elimination. He emphasizes the value of trade-offs and taking the easier path, as well as the need to constantly re-evaluate and change approaches to eliminate problems.
Using useEffect Effectively
React Advanced Conference 2022React Advanced Conference 2022
30 min
Using useEffect Effectively
Top Content
Today's Talk explores the use of the useEffect hook in React development, covering topics such as fetching data, handling race conditions and cleanup, and optimizing performance. It also discusses the correct use of useEffect in React 18, the distinction between Activity Effects and Action Effects, and the potential misuse of useEffect. The Talk highlights the benefits of using useQuery or SWR for data fetching, the problems with using useEffect for initializing global singletons, and the use of state machines for handling effects. The speaker also recommends exploring the beta React docs and using tools like the stately.ai editor for visualizing state machines.
Design Systems: Walking the Line Between Flexibility and Consistency
React Advanced Conference 2021React Advanced Conference 2021
47 min
Design Systems: Walking the Line Between Flexibility and Consistency
Top Content
The Talk discusses the balance between flexibility and consistency in design systems. It explores the API design of the ActionList component and the customization options it offers. The use of component-based APIs and composability is emphasized for flexibility and customization. The Talk also touches on the ActionMenu component and the concept of building for people. The Q&A session covers topics such as component inclusion in design systems, API complexity, and the decision between creating a custom design system or using a component library.
Network Requests with Cypress
TestJS Summit 2021TestJS Summit 2021
33 min
Network Requests with Cypress
Top Content
Cecilia Martinez, a technical account manager at Cypress, discusses network requests in Cypress and demonstrates commands like cydot request and SCI.INTERCEPT. She also explains dynamic matching and aliasing, network stubbing, and the pros and cons of using real server responses versus stubbing. The talk covers logging request responses, testing front-end and backend API, handling list length and DOM traversal, lazy loading, and provides resources for beginners to learn Cypress.
React Concurrency, Explained
React Summit 2023React Summit 2023
23 min
React Concurrency, Explained
Top Content
Watch video: React Concurrency, Explained
React 18's concurrent rendering, specifically the useTransition hook, optimizes app performance by allowing non-urgent updates to be processed without freezing the UI. However, there are drawbacks such as longer processing time for non-urgent updates and increased CPU usage. The useTransition hook works similarly to throttling or bouncing, making it useful for addressing performance issues caused by multiple small components. Libraries like React Query may require the use of alternative APIs to handle urgent and non-urgent updates effectively.
Managing React State: 10 Years of Lessons Learned
React Day Berlin 2023React Day Berlin 2023
16 min
Managing React State: 10 Years of Lessons Learned
Top Content
Watch video: Managing React State: 10 Years of Lessons Learned
This Talk focuses on effective React state management and lessons learned over the past 10 years. Key points include separating related state, utilizing UseReducer for protecting state and updating multiple pieces of state simultaneously, avoiding unnecessary state syncing with useEffect, using abstractions like React Query or SWR for fetching data, simplifying state management with custom hooks, and leveraging refs and third-party libraries for managing state. Additional resources and services are also provided for further learning and support.

Workshops on related topic

React Performance Debugging Masterclass
React Summit 2023React Summit 2023
170 min
React Performance Debugging Masterclass
Top Content
Featured WorkshopFree
Ivan Akulov
Ivan Akulov
Ivan’s first attempts at performance debugging were chaotic. He would see a slow interaction, try a random optimization, see that it didn't help, and keep trying other optimizations until he found the right one (or gave up).
Back then, Ivan didn’t know how to use performance devtools well. He would do a recording in Chrome DevTools or React Profiler, poke around it, try clicking random things, and then close it in frustration a few minutes later. Now, Ivan knows exactly where and what to look for. And in this workshop, Ivan will teach you that too.
Here’s how this is going to work. We’ll take a slow app → debug it (using tools like Chrome DevTools, React Profiler, and why-did-you-render) → pinpoint the bottleneck → and then repeat, several times more. We won’t talk about the solutions (in 90% of the cases, it’s just the ol’ regular useMemo() or memo()). But we’ll talk about everything that comes before – and learn how to analyze any React performance problem, step by step.
(Note: This workshop is best suited for engineers who are already familiar with how useMemo() and memo() work – but want to get better at using the performance tools around React. Also, we’ll be covering interaction performance, not load speed, so you won’t hear a word about Lighthouse 🤐)
React Hooks Tips Only the Pros Know
React Summit Remote Edition 2021React Summit Remote Edition 2021
177 min
React Hooks Tips Only the Pros Know
Top Content
Featured Workshop
Maurice de Beijer
Maurice de Beijer
The addition of the hooks API to React was quite a major change. Before hooks most components had to be class based. Now, with hooks, these are often much simpler functional components. Hooks can be really simple to use. Almost deceptively simple. Because there are still plenty of ways you can mess up with hooks. And it often turns out there are many ways where you can improve your components a better understanding of how each React hook can be used.You will learn all about the pros and cons of the various hooks. You will learn when to use useState() versus useReducer(). We will look at using useContext() efficiently. You will see when to use useLayoutEffect() and when useEffect() is better.
React, TypeScript, and TDD
React Advanced Conference 2021React Advanced Conference 2021
174 min
React, TypeScript, and TDD
Top Content
Featured WorkshopFree
Paul Everitt
Paul Everitt
ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.
Designing Effective Tests With React Testing Library
React Summit 2023React Summit 2023
151 min
Designing Effective Tests With React Testing Library
Top Content
Featured Workshop
Josh Justice
Josh Justice
React Testing Library is a great framework for React component tests because there are a lot of questions it answers for you, so you don’t need to worry about those questions. But that doesn’t mean testing is easy. There are still a lot of questions you have to figure out for yourself: How many component tests should you write vs end-to-end tests or lower-level unit tests? How can you test a certain line of code that is tricky to test? And what in the world are you supposed to do about that persistent act() warning?
In this three-hour workshop we’ll introduce React Testing Library along with a mental model for how to think about designing your component tests. This mental model will help you see how to test each bit of logic, whether or not to mock dependencies, and will help improve the design of your components. You’ll walk away with the tools, techniques, and principles you need to implement low-cost, high-value component tests.
Table of contents- The different kinds of React application tests, and where component tests fit in- A mental model for thinking about the inputs and outputs of the components you test- Options for selecting DOM elements to verify and interact with them- The value of mocks and why they shouldn’t be avoided- The challenges with asynchrony in RTL tests and how to handle them
Prerequisites- Familiarity with building applications with React- Basic experience writing automated tests with Jest or another unit testing framework- You do not need any experience with React Testing Library- Machine setup: Node LTS, Yarn
Master JavaScript Patterns
JSNation 2024JSNation 2024
145 min
Master JavaScript Patterns
Featured Workshop
Adrian Hajdin
Adrian Hajdin
During this workshop, participants will review the essential JavaScript patterns that every developer should know. Through hands-on exercises, real-world examples, and interactive discussions, attendees will deepen their understanding of best practices for organizing code, solving common challenges, and designing scalable architectures. By the end of the workshop, participants will gain newfound confidence in their ability to write high-quality JavaScript code that stands the test of time.
Points Covered:
1. Introduction to JavaScript Patterns2. Foundational Patterns3. Object Creation Patterns4. Behavioral Patterns5. Architectural Patterns6. Hands-On Exercises and Case Studies
How It Will Help Developers:
- Gain a deep understanding of JavaScript patterns and their applications in real-world scenarios- Learn best practices for organizing code, solving common challenges, and designing scalable architectures- Enhance problem-solving skills and code readability- Improve collaboration and communication within development teams- Accelerate career growth and opportunities for advancement in the software industry
How to Start With Cypress
TestJS Summit 2022TestJS Summit 2022
146 min
How to Start With Cypress
Featured WorkshopFree
Filip Hric
Filip Hric
The web has evolved. Finally, testing has also. Cypress is a modern testing tool that answers the testing needs of modern web applications. It has been gaining a lot of traction in the last couple of years, gaining worldwide popularity. If you have been waiting to learn Cypress, wait no more! Filip Hric will guide you through the first steps on how to start using Cypress and set up a project on your own. The good news is, learning Cypress is incredibly easy. You'll write your first test in no time, and then you'll discover how to write a full end-to-end test for a modern web application. You'll learn the core concepts like retry-ability. Discover how to work and interact with your application and learn how to combine API and UI tests. Throughout this whole workshop, we will write code and do practical exercises. You will leave with a hands-on experience that you can translate to your own project.