Video Summary and Transcription
This talk covers the importance of a structured process for incident management and the need for a business mindset. It outlines a five-pillar structured process and emphasizes the importance of staying calm and asking the right questions during incidents. The talk also highlights the importance of effectively identifying, categorizing, and investigating incidents, as well as prioritizing root causes and communicating incident resolutions. Additionally, it discusses the role of incident managers, proactive measures for continuous improvement, and the importance of preparation and a proactive mindset.
1. Introduction to Incident Management
In this part of the talk, I will cover the importance of a structured process for incident management and the need for a business mindset. I will also emphasize the rule of knowing that everything fails all the time.
Hi everyone. Thank you for joining my talk about navigating the chaos, aka production incidents. When I was in high school, the common belief was that if you were actively listening in class, you will have 50% of the exam prep already in your pocket. I want to show you how I adopted this belief to an actual proactive approach that you could take that will help you manage incidents more efficiently in a more structured way and eventually preserve much needed hours of sleep.
But first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer. I have a lot of things to say about myself, but basically the most important thing that you need to know about myself in terms of this presentation is that I handled a lot of production incidents when I was on call. When I wasn't on call, big corporates, startups, I've seen a lot of things. So this is why I'm able to bring the things that I learned along the way to this presentation.
So let's cover the agenda for today. We will first of all cover the mindset that you should have in order to really practice and manage incidents efficiently. Then we will cover the incident flow, aka a structured process that you can do and take in order to manage incidents efficiently and being proactive, things that you can do in your day to day and after an incident took place that will help you come prepared for the next incident that will happen.
So first of all, let's set a baseline here. Incident management is a set of procedures and actions taken to resolve a critical incident. And it basically means that it is an end-to-end process that defines how incidents are detected and communicated, who is responsible to handle them, what tools are used to investigate and respond to them, and what steps are taken towards a solution. And a thing that we really need to think about when we come to deal with incidents in production is first of all, that not all pages that you get through Ops Uni or PagerDuty or any other tool that you're using, not all pages become an incident. When it is an incident, when you have a loss or potential loss in revenues, customers, data, and reputation.
And if we don't have incident management process, if we are just in an ad hoc putting out files kind of approach and mindset, it means that we will potentially lose valuable data, downtime could potentially lead to reduced productivity and revenues, and the business could hold a bridge of service level agreements. Every company has its own SLAs and we want to avoid breaching those SLAs. So it means that we need to avoid being in an ad hoc manner of, okay, something happened, now we need to put out this file. We need to have a structured process towards resolving incidents.
And how do we do that? First of all, with reframing our perspective, we need to have business mindset. Meaning that whenever you deal with something, whenever you implement something at work, whenever you do anything at work, you need to think not only about the systems that you are incorporating and implementing, but also understanding the why. Why we're doing things in a certain way? Why do we have this system? How does it help us do things and how does it help the business succeed? So business mindset is needed in order to grasp the overall impact of incidents and mitigate damages accordingly. And this is why it has to be a structured process. This is where you will incorporate the business mindset and make sure that things are handled as quickly as possible for the sake of the business.
And in order to really, let's say, manage incidents efficiently, the number one rule of managing incidents and to be a better engineer in general is to know that everything fails all the time. Who said this, by the way? You would ask. First of all, me. I said that a lot throughout my career.
2. Structured Process and Types of People
Incidents are mayhem by nature. Having a structured process will help prevent incidents, improve mean time to resolution, reduce costs, and preserve the business and reputation. We will follow a structured process with five pillars. I will share questions to ask and answer in each pillar to progress towards incident resolution. Two types of people in incident management: those who stay calm and those who can't. Asking the right questions will help you stay calm and progress through each phase.
But first of all, I think it is very odd to quote myself. And second of all, I found someone with a little bit more credit in the industry than me. AWS CTO, Werner Vogels. So take his word for it. Everything fails all the time. Production systems, development environments, pipelines, things that we build, things that we buy, systems that we rely on to know that production is down, aka our monitoring systems. Even us as human beings, we crash and we need to sleep and then restart ourselves. Everything fails all the time. And that's exactly it. Incidents are mayhem by nature. But if we have this fact, if we know that failures are a given because everything fails all the time, then we can't be in an ad hoc manner of putting out files. We can say, OK, this happened, but I'm prepared to deal with it. So this is the whole idea of having a structured process and a structured process will help leading to incidents prevention, improved mean time to resolution, cost reduction because downtime was reduced or eliminated entirely, and to preserve our business customers and reputation. And how do we do that? We have a structure process that we can follow here, five pillars, and we'll go over each pillar and I will show you questions that you can ask and answer in each pillar in order to progress to the next one. But beforehand, I want to show you two types of people that I met in my entire journey in production, managing production incidents. I met this type. First of all, the one that says keep calm, I'm an engineer. And the other type that says I can't keep calm. I'm an engineer. These are the two types that I met. And I say that you can keep calm if you ask yourselves the questions I'm going to share with you in each pillar and then it will help you progress towards the next phase and up until the incident resolution.
3. Identify and Categorize
To effectively handle incidents, you need to understand the problem's extent and its business impact. Determine if the issue can wait or needs immediate attention. Ensure you receive alerts from proper channels. Escalate if necessary.
So let's see. First pillar, identify and categorize. First question is, do I understand the full extent of the problem and the business impact? If so, great. Let's dive in and go to the next phase. And if not, I need to gather more information because the fact that the alert is there, it doesn't mean that you need to handle it in the severity that it is. Maybe the alert has the not correct severity. So if you're not sure what's the impact of the incident or the issue or the page that you got, make sure you do because then you would understand the business impact of that issue. Second question, can this issue, incident, whatever it is, can this wait and be handled in business hours? If you're not sure, ask. Use the information that you got and escalate if needed, which is our next phase. And check how you got to know about this issue. Was I notified about this issue from the proper or expected channels, aka if I got it from an alert from PagerDuty or OpsGenie, great. If I got it from a user complaint, this is bad. So if I did get it from the proper channels, great. If not, add a note to self to fix it, aka create a Jira ticket in order to make sure we have an alert for that.
4. Notify, Escalate, Investigate
During an incident, notify the relevant teams and stakeholders. Determine if escalation is necessary for timely resolution. Investigate and diagnose the incident, focusing on relevant information. Escalate if needed to avoid breaching SLAs.
Next pillar is notify and escalate. Who should be notified about this incident? We have these two paths here, during the incident and in general. During this incident, you need to decide based on incident importance. If, for example, we need to alert a support or customer success teams and that they need to communicate the issue to customers, we need to know about it and we need to act upon it. And in general, maybe we have other teams or key focal points that rely on our system. So we have a system that is being compromised in this incident and other teams, their flows doesn't work because the incident doesn't work. So that system doesn't work. So we have this flow that we need to make sure that works. And if it doesn't work, we need to alert those people to say that, hey, this system doesn't work at the moment. We will notify you once everything goes back to normal.
And next question is, does this incident need escalation? First of all, for other teams to help me resolve the issue. And as I said, FYI, support or customer facing teams. So if the incident requires escalation, this is the time to do it in order to not waste more time because maybe we have currently downtime and we want to make sure the issue is resolved as soon as possible.
Third pillar, investigate and diagnose. What information is relevant towards the incident resolution? You need to focus on what's important and relevant right now because focusing on the non-relevant will throw you off route and make you lose valuable time doing debugging. And also remember, system flow usually comprises of a lot of parts, moving parts, and you need to focus on the relevant phase for debugging and escalation. So if you escalate, you need to tell them, hey, my system tries to get to your system through port X. It doesn't work. Please help me fix it and not describe the entire flow of the system. Nobody cares about it. Just tell them what is currently not working and help them get focused on what they need to check. Okay. I had this information. I troubleshoot the issue. Great. Now, after I did some debugging, did I find the root cause and do I understand the root cause? If so, great. We can progress to the next phase. If not, investigate more and escalate if it takes long. Why escalate at that point? Because, again, business mindset. We want to avoid breaching SLAs.
5. Root Causes, Remediation, Closure
Prioritize root causes over symptoms. Choose the fastest solution to eliminate downtime without compromising system health and stability. Check for action items after resolving the issue. Notify relevant parties upon incident closure.
So that's the reason why to do it. And also we need to prioritize root causes over surface-level symptoms. If you have an alert of service on a server stopped, and, okay, you can go and start the service, or you can understand and investigate why the service got stopped in the first place, because that way you could potentially expose an underlying issue and we all want to have a stable system. So just starting the service wouldn't do it. You need to make sure that you know why it got stopped in the first place. So that's about that.
And remember that if we find the root cause, possible remediation steps can be determined, which leads me to the next phase in a second. So we found the root cause. Now we have possible remediation steps that we can take. Which possible remediation step is the best one to take? We need to choose the fastest solution to eliminate downtime without compromising systems, health and stability. Why is that? And why should it be fast? First of all, because if it's in the middle of the night, we want to go back to sleep. But also, of course, for the business sake, we want to have the service up and running as soon as possible. So that's about that. Once we decided about the remediation step that we need to take, check if there are any action items that needed to be done after resolving the issue. So, for example, if it was in the middle of the night and the remediation step that was taken is to do a patch, because it's the middle of the night, you're not going to do a full blown solution in the middle of the night. And everyone is aware of that, and that's good. But if you did a patch, permanently fix it and make sure it is permanently fixed during business hours. And why is that? Because we want to prevent recurring issues from happening over and over. For you not to wake up, of course, but also, again, for the system to be stable. We care about the system. We want the system to be stable and healthy. So we want to prevent any recurring issues. So if we did a patch, let's permanently fix the issues. But this is one example. If you have any action items needed after resolving the issue, this is the time to do it.
And upon incident closure, what needs to be done? So do I need to notify anyone on the incident resolution? We need to be end to end communicators. So if we notify and escalate a pillar, we notify some people. We need to notify the same people and say, OK, I think the issue should now be resolved. Please check from your end that everything looks OK. Also, if the incident was critical to notify the customers.
6. Communication, Alerts, Incident Run Book
Ensure incident resolution is communicated and verify if it is fully resolved. Check and tweak alerts as needed. Have an up-to-date incident run book for systems handled by others.
And it would be very good to know that we think it's solved. But maybe only people from Germany are not able to use the system. You never know. So once you notify the people that, OK, everything should be resolved. But please let us know if anything doesn't work. Then you are both end to end communicator, not only when things don't work, but also when things got back to work. But also it will help you understand if the incident really got resolved fully.
Check the alerts. Were the alerts OK? Or they need to be tweaked, because maybe you need to change alert severity or you need to fix false positives. So tweak it in any way that might be needed. Check the relevant incident run book. Just to have a baseline here. What is an incident run book? Sometimes when you have some issues or procedures you need to do that requires some judgment. Like if an incident takes place, I check the logs, of course. But if the logs state X, then I do this. But if it states something else, maybe I need to do something else. Or maybe I need to consult with someone. So every time we will have, let's say, something that requires judgment. We should have a relevant incident run book that will help us exercise this judgment.
Especially if not everyone knows everything about all the systems. Sometimes the system was implemented by a team member, for example. So in order for you to handle issues on that system, because you won't be able to deal and manage the system in the day-to-day like your team member, you should have an incident run book that will help you resolve any issues on that system. So make sure that a relevant incident run book is in place. And make sure it's not outdated. You want to make sure it is up to date. Because I had times where I followed an incident run book up until it's half, something like that. And then afterwards it wasn't up to date. And I had to go to people and ask them what's next because it wasn't up to date. And, of course, I updated afterwards. But it's in the middle of the night when you have an incident run book that is not up to date.
7. Incident Management Best Practices
Avoid waking people unnecessarily, keep incident resolution fast, and prevent future incidents. Check and update incident run books. Handle preventable issues during business hours. Consider postmortem for learning and improvement.
It's not great to go and wake people up just because of that. And also it makes resolving the incident slower. And we all want to have a faster resolution for the business sake. So check the incident run books, that you have run books, and that they are up to date.
Think about the fact that if you can help prevent any similar or any incidents from occurring. For example, during the incidents I found out that there is no local date on the server. Not great. I need to do it. So I will create a ticket and deal with it during business hours. So this is one example. But any example that comes to mind that will help preventing any issues from happening that you came across during handling this incident. You can open Jira tickets and handle it in business hours and it will help making the system more stable.
Does this incident require a postmortem? Postmortems are the meetings that we have, usually after critical incidents. That basically should be in a learn culture and not blame culture. And it means that, okay, we have this incident. How can we learn from it? What did we do wrong and how can we learn from it and improve and potentially prevent future similar incidents from happening? Or not only similar. We can learn a lot about how we handle things from this process. And we can implement it for any other incidents to come.
8. Postmortems and War Room Conduct
Consider postmortem or knowledge sharing for incidents. War room conduct is important for critical incidents involving multiple teams. Avoid wasting time and focus on relevant information for resolution.
So does this incident require a postmortem? If so, great. Jot down the notes as soon as possible while it is still fresh in your mind. And that way you will have a more efficient postmortem meeting. Because once we have all the details, we can discuss them more thoroughly. And even if you don't have to do a postmortem meeting, still share the knowledge through a runbook or through a daily brief. And then I'm sure it will help anyone learn more about what happened and learn from your line of thought. It's a win-win situation.
So that was about an incident structure and how you can manage incidents, any incidents, if you follow this structure. And I want to cover some bits here about war room conduct. War room is basically when you have a critical incident that requires more than, I would say, four or five people to handle this incident. So people come from other teams or cross-functional teams. And this is what we call a war room. And there should be a conduct for that as well.
So a lot of people just talked and talked and talked and he pulled to this direction. And nothing progressed towards resolution. And I checked the clock and it was like 10 or 12 minutes in and nothing happening.
9. War Room Conduct and Incident Management
Joined the war room, observed people sharing non-relevant information, took the role of incident manager. Identified the need for a runbook to start the application properly.
So a war room was created through Zoom. And I was very new at the company, like a month in, something like that. And I joined the war room, the Zoom one. I muted myself, just wanted to be a fly on the wall and learn because I knew that I will learn from whatever is happening there. So I joined.
And then I hear people discuss, a lot of people. He takes to this direction and he pulls to this direction. And everyone is sharing non-relevant information. This is why I told you before, focus on the relevant information, because I've seen it happen.
You need people not focusing on the relevant information and it wastes time. Not only time but also takes the focus out of what is important. So a lot of people just talked and talked and talked and he pulled to this direction. And nothing progressed towards resolution. And I checked the clock and it was like 10 or 12 minutes in and nothing happening. Just people talking and that's it.
So I unmuted myself. I said, hi, I'm Hilah, for those of you who don't know me because I was new at the company. And I said, let me try and make some order here. Okay? And basically I took upon myself the role of being an incident manager. And so one of the things that I did, I heard someone say that once the issue gets resolved, the application needs to get started in a certain way.
Because if not, it will create other issues with the database and stuff. And then I asked this person, do we have a runbook to start the application in that order? And he said, no. And then I'm like, okay, you sit down and write a runbook. And why? Because it was a critical incident. We didn't know when the incident will get resolved. If it will be in an hour, two hours, in the middle of the night. And if it happens and this person is not available, we need someone else to start the application properly. And we don't want to have like a bottleneck and single point of failure of him. That he's the only one that knows how to start the application properly. So I told him, you create the runbook.
10. Roles and Responsibilities of Incident Managers
Divided the work, told people what to do, reduced involvement if it doesn't serve the purpose. Incident managers should be calm and collected.
And that way, if you're not available, I don't care. Anyone else can start the application properly whenever the incident gets resolved. So this is one example of what I did. But I did other things. Like I told him, you check this, you check that. And basically what I did is to divide the work, tell people what to do. Incident managers should be calm and collected and see things clearly. And most importantly, not afraid to reduce people's involvement if it doesn't serve the purpose. Because if a workroom has too many people, it could get very noisy. And especially during office times, when you sit on your computer and then there are the people that just stand above you. And like, what are you doing? And some people, it gets them stressed. So if you're not supposed to be there, I will say, you help with one ABC. You finish with it, okay, thank you so much. We will call you if we need anything else. For the meantime, please go away. So that's about it.
11. Proactive Measures and Continuous Improvement
Create on-call shifts handoffs, do a post-mortem and retrospective, create new tasks, modify alerts, update incident runbooks, check candidates for self-remediation.
Okay, so we covered mindset, we covered incident flow. Let's cover very quickly being proactive in the day-to-day and after an incident took place. And why do we need to come prepared? Because it doesn't matter if you're prepared or not, they will find you. And they are paid your duty, or any other apps.
So after the fact, what can you do? Create on-call shifts handoffs. On-call shifts handoffs are basically whenever you have a shift at work, write an on-call shift handoff of things that happened. Like I suppressed this false positive alert. I had a recurring issue. Alert X is waiting for dev to check it out. Write your shift summary, so your team members will benefit from it during their on-call shifts. And also for audit purposes, because it is kept in Slack, and everything can go back to it afterwards.
Post-mortem, as I mentioned before, even if there's no meeting, do a mental check. Do a retrospective with yourself and see what you could have done better. And if you have a post-mortem, write down the notes as soon as possible for a more efficient meeting. New tasks. We want to prevent the next incident from happening and stabilize the environment. So if you found anything that could help do that, create new tasks for that. Modify alerts. Fix any false positive alerts. Please don't wait for the next on-call to do it, because they will wait for the next on-call to do it. And they will wait, and then it will never happen. So please do it yourself. Incident runbooks, as I mentioned, write runbooks if you don't have them at all. You do have them. Make sure that they are up to date. Check any candidates for self-remediation. We have a bunch of alerts of disk space. It fills up to 90%. Maybe we can do things automatically to clean up the disk once in a while. So if you find out any candidates for self-remediation, this is the time to do it.
12. Preparation and Proactive Mindset
Share knowledge, read on-call handoffs, be prepared, know escalation points, understand system architecture, learn application flows, be familiar with team member stats, be a go-to person.
And if the issue was handled, great. Share the knowledge more in depth than in the on-call handoff. Because that way everyone can learn from your line of thought.
And what can you do in your day-to-day in order to come prepared for an incident? The on-call shifts handoffs that I mentioned, read them on an ongoing basis. Why? Because production runs 24-7, not only when you are on-call. So if you want to be on top of things and get up to date, read these handoffs and be up to speed with what's going on in production. Plus, maybe you could also pitch in and help make things better by seeing other things from the side. It will help in certain scenarios.
Escalation point of contact. So you should know the needed pieces of information relevant to your realm infrastructure. But you should also know other realms as well and have the full picture. So let's say there's an issue with X. If you know that John is handling the service from the other side, then you know that you can escalate to him. So identifying service escalation points on a day-to-day basis and not only ad hoc when an incident occurs will save time and money on incident management and save someone else's hours of sleep because maybe I need to wake my team leader up to ask who is responsible for service X. So it could really help with debugging and save hours of sleep for anyone else.
Understand system architecture. Check for any weaker areas and vulnerabilities and any sensitive and blast radius scope because that way you will know what is prone to fail and you will have a go to fix it. So once you know system architecture, it will help you very much with the debugging and to solve the issues.
Learn application flows. So this is about flows between systems as opposed to the previous bullet which was about the flow and architecture of one system to know its ins and outs. So in here, learn application flows. If you know the application flows, it will help with troubleshooting because I know what needs to be checked, in which order, and it will contribute to the methodical debugging. It will also help you incorporate the business mindset because if you understand that escalation is needed, this issue is actually an incident, etc., then it will help with how to handle it.
Team member stats. As I mentioned before, production happens all the time and not through only your tasks. So be familiar with what your other team members are doing and how their changes affect production, if any, and this bullet is about 100% changes in production. So other tasks might not touch production, but deployment or changes in production definitely do. So ask about the change and its possible impact because, again, Ops Unit or PagerDuty doesn't care if you didn't do the change yourself. It will call you anyway if you are on call. So make sure you know exactly what was the change about and how to handle it.
And last but not least, be a go-to person. If you are a go-to person, you will get push notifications and decrease the need to fetch the updates on your own because people will come to you to update you on what's going on in production. So in order to really navigate a chaos and handle production incidents more efficiently, incorporate business mindset, make it a structured process, and be proactive. And that way you will come prepared to any incident that will cross your way and hopefully prevent the next incident from happening. And remember, less incidents means less downtime, means basic success. And business success is eventually your success. Plus, you get to preserve much-needed hours of sleep. Thank you very much.
Comments