Hi everyone. Thank you for joining my talk about navigating the chaos, aka production incidents. When I was in high school, the common belief was that if you were actively listening in class, you will have 50% of the exam prep already in your pocket. I want to show you how I adopted this belief to an actual proactive approach that you could take that will help you manage incidents more efficiently in a more structured way and eventually preserve much needed hours of sleep.
But first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer. I have a lot of things to say about myself, but basically the most important thing that you need to know about myself in terms of this presentation is that I handled a lot of production incidents when I was on call. When I wasn't on call, big corporates, startups, I've seen a lot of things. So this is why I'm able to bring the things that I learned along the way to this presentation.
So let's cover the agenda for today. We will first of all cover the mindset that you should have in order to really practice and manage incidents efficiently. Then we will cover the incident flow, aka a structured process that you can do and take in order to manage incidents efficiently and being proactive, things that you can do in your day to day and after an incident took place that will help you come prepared for the next incident that will happen.
So first of all, let's set a baseline here. Incident management is a set of procedures and actions taken to resolve a critical incident. And it basically means that it is an end-to-end process that defines how incidents are detected and communicated, who is responsible to handle them, what tools are used to investigate and respond to them, and what steps are taken towards a solution. And a thing that we really need to think about when we come to deal with incidents in production is first of all, that not all pages that you get through Ops Uni or PagerDuty or any other tool that you're using, not all pages become an incident. When it is an incident, when you have a loss or potential loss in revenues, customers, data, and reputation.
And if we don't have incident management process, if we are just in an ad hoc putting out files kind of approach and mindset, it means that we will potentially lose valuable data, downtime could potentially lead to reduced productivity and revenues, and the business could hold a bridge of service level agreements. Every company has its own SLAs and we want to avoid breaching those SLAs. So it means that we need to avoid being in an ad hoc manner of, okay, something happened, now we need to put out this file. We need to have a structured process towards resolving incidents.
And how do we do that? First of all, with reframing our perspective, we need to have business mindset. Meaning that whenever you deal with something, whenever you implement something at work, whenever you do anything at work, you need to think not only about the systems that you are incorporating and implementing, but also understanding the why. Why we're doing things in a certain way? Why do we have this system? How does it help us do things and how does it help the business succeed? So business mindset is needed in order to grasp the overall impact of incidents and mitigate damages accordingly. And this is why it has to be a structured process. This is where you will incorporate the business mindset and make sure that things are handled as quickly as possible for the sake of the business.
And in order to really, let's say, manage incidents efficiently, the number one rule of managing incidents and to be a better engineer in general is to know that everything fails all the time. Who said this, by the way? You would ask. First of all, me. I said that a lot throughout my career.
Comments