English versionEN

Navigating the Chaos: A Holistic Approach to Incident Management

AWS Community Builder, Hashicorp Ambassador, International Public Speaker

Incident management can be challenging and throw you curveballs with unexpected issues, resulting in data loss, downtimes, and overall money & hours of sleep going to waste, BUT! There are practical things you could do to make it a smoother process and handle it better.
Remember when we were at school, and people said - "Actively listening in class guarantees 50% prep for the upcoming test"?
The same goes for being proactive at work in ways that will instantly prepare you to manage incidents better (at night or in general).
In this talk, I'll cover the proactive ways you could take and incorporate into your day-to-day routine, in order to prepare you for a smoother and more efficient incident management process.
I will also show the best practices I've finalized over the years that helped me get a clear vision of how to manage production incidents in the quickest & efficient way possible.
Embracing the tips I'll give you will guarantee you'll not only talk the talk but also walk the walk when it comes to incident management.

This talk has been presented at DevOps.js Conf 2024, check out the latest edition of this JavaScript Conference.

FAQ

Incident management is a set of procedures and actions taken to resolve a critical incident. It involves detecting and communicating incidents, assigning responsibility for handling them, utilizing tools for investigation and response, and executing steps towards resolution.

A proactive approach helps in managing production incidents by preparing for incidents before they occur, improving the mean time to resolution, reducing costs by minimizing downtime, and preserving the business's customers and reputation.

Having a business mindset in incident management means understanding the broader impact of incidents on the business, which includes the potential loss in revenues, customers, data, and reputation. This mindset helps in prioritizing actions and decision-making processes that align with business goals.

A postmortem after a critical incident should include a detailed discussion about what went wrong, what was done to resolve the issue, lessons learned, and actions that can be taken to prevent future incidents. It's important to conduct this in a blame-free manner to focus on improvement.

The philosophy 'Everything fails all the time' reminds us that failures are inevitable in any system. Adopting this mindset in incident management prepares teams to handle failures proactively rather than reactively, minimizing downtime and reducing the impact on business operations.

Best practices for conducting a war room during a critical incident include maintaining focus on relevant information, assigning clear roles and responsibilities, reducing noise by limiting unnecessary participation, and ensuring documentation like runbooks are available and up to date for efficient handling of the incident.

On-call shift handoffs improve incident management by providing subsequent teams with a summary of incidents handled, actions taken, and any unresolved issues. This ensures continuity and preparedness, helping teams respond more effectively to incidents.

The structured process in incident management includes five pillars: identify and categorize the incident, notify and escalate to the relevant parties, investigate and diagnose the issue, determine and apply remediation steps, and review and update the incident runbooks and alerts post-incident.

testing best practices

Hila Fish

26 min

15 Feb, 2024

Comments

Video Summary and Transcription

This talk covers the importance of a structured process for incident management and the need for a business mindset. It outlines a five-pillar structured process and emphasizes the importance of staying calm and asking the right questions during incidents. The talk also highlights the importance of effectively identifying, categorizing, and investigating incidents, as well as prioritizing root causes and communicating incident resolutions. Additionally, it discusses the role of incident managers, proactive measures for continuous improvement, and the importance of preparation and a proactive mindset.

Available in Español: Navegando el Caos: Un Enfoque Holístico para la Gestión de Incidentes

1. Introduction to Incident Management

Short description:

In this part of the talk, I will cover the importance of a structured process for incident management and the need for a business mindset. I will also emphasize the rule of knowing that everything fails all the time.

Hi everyone. Thank you for joining my talk about navigating the chaos, aka production incidents. When I was in high school, the common belief was that if you were actively listening in class, you will have 50% of the exam prep already in your pocket. I want to show you how I adopted this belief to an actual proactive approach that you could take that will help you manage incidents more efficiently in a more structured way and eventually preserve much needed hours of sleep.

But first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer. I have a lot of things to say about myself, but basically the most important thing that you need to know about myself in terms of this presentation is that I handled a lot of production incidents when I was on call. When I wasn't on call, big corporates, startups, I've seen a lot of things. So this is why I'm able to bring the things that I learned along the way to this presentation.

So let's cover the agenda for today. We will first of all cover the mindset that you should have in order to really practice and manage incidents efficiently. Then we will cover the incident flow, aka a structured process that you can do and take in order to manage incidents efficiently and being proactive, things that you can do in your day to day and after an incident took place that will help you come prepared for the next incident that will happen.

So first of all, let's set a baseline here. Incident management is a set of procedures and actions taken to resolve a critical incident. And it basically means that it is an end-to-end process that defines how incidents are detected and communicated, who is responsible to handle them, what tools are used to investigate and respond to them, and what steps are taken towards a solution. And a thing that we really need to think about when we come to deal with incidents in production is first of all, that not all pages that you get through Ops Uni or PagerDuty or any other tool that you're using, not all pages become an incident. When it is an incident, when you have a loss or potential loss in revenues, customers, data, and reputation.

And if we don't have incident management process, if we are just in an ad hoc putting out files kind of approach and mindset, it means that we will potentially lose valuable data, downtime could potentially lead to reduced productivity and revenues, and the business could hold a bridge of service level agreements. Every company has its own SLAs and we want to avoid breaching those SLAs. So it means that we need to avoid being in an ad hoc manner of, okay, something happened, now we need to put out this file. We need to have a structured process towards resolving incidents.

And how do we do that? First of all, with reframing our perspective, we need to have business mindset. Meaning that whenever you deal with something, whenever you implement something at work, whenever you do anything at work, you need to think not only about the systems that you are incorporating and implementing, but also understanding the why. Why we're doing things in a certain way? Why do we have this system? How does it help us do things and how does it help the business succeed? So business mindset is needed in order to grasp the overall impact of incidents and mitigate damages accordingly. And this is why it has to be a structured process. This is where you will incorporate the business mindset and make sure that things are handled as quickly as possible for the sake of the business.

And in order to really, let's say, manage incidents efficiently, the number one rule of managing incidents and to be a better engineer in general is to know that everything fails all the time. Who said this, by the way? You would ask. First of all, me. I said that a lot throughout my career.

2. Structured Process and Types of People

Short description:

Incidents are mayhem by nature. Having a structured process will help prevent incidents, improve mean time to resolution, reduce costs, and preserve the business and reputation. We will follow a structured process with five pillars. I will share questions to ask and answer in each pillar to progress towards incident resolution. Two types of people in incident management: those who stay calm and those who can't. Asking the right questions will help you stay calm and progress through each phase.

But first of all, I think it is very odd to quote myself. And second of all, I found someone with a little bit more credit in the industry than me. AWS CTO, Werner Vogels. So take his word for it. Everything fails all the time. Production systems, development environments, pipelines, things that we build, things that we buy, systems that we rely on to know that production is down, aka our monitoring systems. Even us as human beings, we crash and we need to sleep and then restart ourselves. Everything fails all the time. And that's exactly it. Incidents are mayhem by nature. But if we have this fact, if we know that failures are a given because everything fails all the time, then we can't be in an ad hoc manner of putting out files. We can say, OK, this happened, but I'm prepared to deal with it. So this is the whole idea of having a structured process and a structured process will help leading to incidents prevention, improved mean time to resolution, cost reduction because downtime was reduced or eliminated entirely, and to preserve our business customers and reputation. And how do we do that? We have a structure process that we can follow here, five pillars, and we'll go over each pillar and I will show you questions that you can ask and answer in each pillar in order to progress to the next one. But beforehand, I want to show you two types of people that I met in my entire journey in production, managing production incidents. I met this type. First of all, the one that says keep calm, I'm an engineer. And the other type that says I can't keep calm. I'm an engineer. These are the two types that I met. And I say that you can keep calm if you ask yourselves the questions I'm going to share with you in each pillar and then it will help you progress towards the next phase and up until the incident resolution.

3. Identify and Categorize

Short description:

To effectively handle incidents, you need to understand the problem's extent and its business impact. Determine if the issue can wait or needs immediate attention. Ensure you receive alerts from proper channels. Escalate if necessary.

So let's see. First pillar, identify and categorize. First question is, do I understand the full extent of the problem and the business impact? If so, great. Let's dive in and go to the next phase. And if not, I need to gather more information because the fact that the alert is there, it doesn't mean that you need to handle it in the severity that it is. Maybe the alert has the not correct severity. So if you're not sure what's the impact of the incident or the issue or the page that you got, make sure you do because then you would understand the business impact of that issue. Second question, can this issue, incident, whatever it is, can this wait and be handled in business hours? If you're not sure, ask. Use the information that you got and escalate if needed, which is our next phase. And check how you got to know about this issue. Was I notified about this issue from the proper or expected channels, aka if I got it from an alert from PagerDuty or OpsGenie, great. If I got it from a user complaint, this is bad. So if I did get it from the proper channels, great. If not, add a note to self to fix it, aka create a Jira ticket in order to make sure we have an alert for that.

4. Notify, Escalate, Investigate

Short description:

During an incident, notify the relevant teams and stakeholders. Determine if escalation is necessary for timely resolution. Investigate and diagnose the incident, focusing on relevant information. Escalate if needed to avoid breaching SLAs.

Next pillar is notify and escalate. Who should be notified about this incident? We have these two paths here, during the incident and in general. During this incident, you need to decide based on incident importance. If, for example, we need to alert a support or customer success teams and that they need to communicate the issue to customers, we need to know about it and we need to act upon it. And in general, maybe we have other teams or key focal points that rely on our system. So we have a system that is being compromised in this incident and other teams, their flows doesn't work because the incident doesn't work. So that system doesn't work. So we have this flow that we need to make sure that works. And if it doesn't work, we need to alert those people to say that, hey, this system doesn't work at the moment. We will notify you once everything goes back to normal.

And next question is, does this incident need escalation? First of all, for other teams to help me resolve the issue. And as I said, FYI, support or customer facing teams. So if the incident requires escalation, this is the time to do it in order to not waste more time because maybe we have currently downtime and we want to make sure the issue is resolved as soon as possible.

Third pillar, investigate and diagnose. What information is relevant towards the incident resolution? You need to focus on what's important and relevant right now because focusing on the non-relevant will throw you off route and make you lose valuable time doing debugging. And also remember, system flow usually comprises of a lot of parts, moving parts, and you need to focus on the relevant phase for debugging and escalation. So if you escalate, you need to tell them, hey, my system tries to get to your system through port X. It doesn't work. Please help me fix it and not describe the entire flow of the system. Nobody cares about it. Just tell them what is currently not working and help them get focused on what they need to check. Okay. I had this information. I troubleshoot the issue. Great. Now, after I did some debugging, did I find the root cause and do I understand the root cause? If so, great. We can progress to the next phase. If not, investigate more and escalate if it takes long. Why escalate at that point? Because, again, business mindset. We want to avoid breaching SLAs.

5. Root Causes, Remediation, Closure

Short description:

Prioritize root causes over symptoms. Choose the fastest solution to eliminate downtime without compromising system health and stability. Check for action items after resolving the issue. Notify relevant parties upon incident closure.

So that's the reason why to do it. And also we need to prioritize root causes over surface-level symptoms. If you have an alert of service on a server stopped, and, okay, you can go and start the service, or you can understand and investigate why the service got stopped in the first place, because that way you could potentially expose an underlying issue and we all want to have a stable system. So just starting the service wouldn't do it. You need to make sure that you know why it got stopped in the first place. So that's about that.

And remember that if we find the root cause, possible remediation steps can be determined, which leads me to the next phase in a second. So we found the root cause. Now we have possible remediation steps that we can take. Which possible remediation step is the best one to take? We need to choose the fastest solution to eliminate downtime without compromising systems, health and stability. Why is that? And why should it be fast? First of all, because if it's in the middle of the night, we want to go back to sleep. But also, of course, for the business sake, we want to have the service up and running as soon as possible. So that's about that. Once we decided about the remediation step that we need to take, check if there are any action items that needed to be done after resolving the issue. So, for example, if it was in the middle of the night and the remediation step that was taken is to do a patch, because it's the middle of the night, you're not going to do a full blown solution in the middle of the night. And everyone is aware of that, and that's good. But if you did a patch, permanently fix it and make sure it is permanently fixed during business hours. And why is that? Because we want to prevent recurring issues from happening over and over. For you not to wake up, of course, but also, again, for the system to be stable. We care about the system. We want the system to be stable and healthy. So we want to prevent any recurring issues. So if we did a patch, let's permanently fix the issues. But this is one example. If you have any action items needed after resolving the issue, this is the time to do it.

And upon incident closure, what needs to be done? So do I need to notify anyone on the incident resolution? We need to be end to end communicators. So if we notify and escalate a pillar, we notify some people. We need to notify the same people and say, OK, I think the issue should now be resolved. Please check from your end that everything looks OK. Also, if the incident was critical to notify the customers.

6. Communication, Alerts, Incident Run Book

Short description:

Ensure incident resolution is communicated and verify if it is fully resolved. Check and tweak alerts as needed. Have an up-to-date incident run book for systems handled by others.

And it would be very good to know that we think it's solved. But maybe only people from Germany are not able to use the system. You never know. So once you notify the people that, OK, everything should be resolved. But please let us know if anything doesn't work. Then you are both end to end communicator, not only when things don't work, but also when things got back to work. But also it will help you understand if the incident really got resolved fully.

Check the alerts. Were the alerts OK? Or they need to be tweaked, because maybe you need to change alert severity or you need to fix false positives. So tweak it in any way that might be needed. Check the relevant incident run book. Just to have a baseline here. What is an incident run book? Sometimes when you have some issues or procedures you need to do that requires some judgment. Like if an incident takes place, I check the logs, of course. But if the logs state X, then I do this. But if it states something else, maybe I need to do something else. Or maybe I need to consult with someone. So every time we will have, let's say, something that requires judgment. We should have a relevant incident run book that will help us exercise this judgment.

Especially if not everyone knows everything about all the systems. Sometimes the system was implemented by a team member, for example. So in order for you to handle issues on that system, because you won't be able to deal and manage the system in the day-to-day like your team member, you should have an incident run book that will help you resolve any issues on that system. So make sure that a relevant incident run book is in place. And make sure it's not outdated. You want to make sure it is up to date. Because I had times where I followed an incident run book up until it's half, something like that. And then afterwards it wasn't up to date. And I had to go to people and ask them what's next because it wasn't up to date. And, of course, I updated afterwards. But it's in the middle of the night when you have an incident run book that is not up to date.

7. Incident Management Best Practices

Short description:

Avoid waking people unnecessarily, keep incident resolution fast, and prevent future incidents. Check and update incident run books. Handle preventable issues during business hours. Consider postmortem for learning and improvement.

It's not great to go and wake people up just because of that. And also it makes resolving the incident slower. And we all want to have a faster resolution for the business sake. So check the incident run books, that you have run books, and that they are up to date.

Think about the fact that if you can help prevent any similar or any incidents from occurring. For example, during the incidents I found out that there is no local date on the server. Not great. I need to do it. So I will create a ticket and deal with it during business hours. So this is one example. But any example that comes to mind that will help preventing any issues from happening that you came across during handling this incident. You can open Jira tickets and handle it in business hours and it will help making the system more stable.

Does this incident require a postmortem? Postmortems are the meetings that we have, usually after critical incidents. That basically should be in a learn culture and not blame culture. And it means that, okay, we have this incident. How can we learn from it? What did we do wrong and how can we learn from it and improve and potentially prevent future similar incidents from happening? Or not only similar. We can learn a lot about how we handle things from this process. And we can implement it for any other incidents to come.

8. Postmortems and War Room Conduct

Short description:

Consider postmortem or knowledge sharing for incidents. War room conduct is important for critical incidents involving multiple teams. Avoid wasting time and focus on relevant information for resolution.

So does this incident require a postmortem? If so, great. Jot down the notes as soon as possible while it is still fresh in your mind. And that way you will have a more efficient postmortem meeting. Because once we have all the details, we can discuss them more thoroughly. And even if you don't have to do a postmortem meeting, still share the knowledge through a runbook or through a daily brief. And then I'm sure it will help anyone learn more about what happened and learn from your line of thought. It's a win-win situation.

So that was about an incident structure and how you can manage incidents, any incidents, if you follow this structure. And I want to cover some bits here about war room conduct. War room is basically when you have a critical incident that requires more than, I would say, four or five people to handle this incident. So people come from other teams or cross-functional teams. And this is what we call a war room. And there should be a conduct for that as well.

So a lot of people just talked and talked and talked and he pulled to this direction. And nothing progressed towards resolution. And I checked the clock and it was like 10 or 12 minutes in and nothing happening.

9. War Room Conduct and Incident Management

Short description:

Joined the war room, observed people sharing non-relevant information, took the role of incident manager. Identified the need for a runbook to start the application properly.

So a war room was created through Zoom. And I was very new at the company, like a month in, something like that. And I joined the war room, the Zoom one. I muted myself, just wanted to be a fly on the wall and learn because I knew that I will learn from whatever is happening there. So I joined.

And then I hear people discuss, a lot of people. He takes to this direction and he pulls to this direction. And everyone is sharing non-relevant information. This is why I told you before, focus on the relevant information, because I've seen it happen.

You need people not focusing on the relevant information and it wastes time. Not only time but also takes the focus out of what is important. So a lot of people just talked and talked and talked and he pulled to this direction. And nothing progressed towards resolution. And I checked the clock and it was like 10 or 12 minutes in and nothing happening. Just people talking and that's it.

So I unmuted myself. I said, hi, I'm Hilah, for those of you who don't know me because I was new at the company. And I said, let me try and make some order here. Okay? And basically I took upon myself the role of being an incident manager. And so one of the things that I did, I heard someone say that once the issue gets resolved, the application needs to get started in a certain way.

Because if not, it will create other issues with the database and stuff. And then I asked this person, do we have a runbook to start the application in that order? And he said, no. And then I'm like, okay, you sit down and write a runbook. And why? Because it was a critical incident. We didn't know when the incident will get resolved. If it will be in an hour, two hours, in the middle of the night. And if it happens and this person is not available, we need someone else to start the application properly. And we don't want to have like a bottleneck and single point of failure of him. That he's the only one that knows how to start the application properly. So I told him, you create the runbook.

10. Roles and Responsibilities of Incident Managers

Short description:

Divided the work, told people what to do, reduced involvement if it doesn't serve the purpose. Incident managers should be calm and collected.

And that way, if you're not available, I don't care. Anyone else can start the application properly whenever the incident gets resolved. So this is one example of what I did. But I did other things. Like I told him, you check this, you check that. And basically what I did is to divide the work, tell people what to do. Incident managers should be calm and collected and see things clearly. And most importantly, not afraid to reduce people's involvement if it doesn't serve the purpose. Because if a workroom has too many people, it could get very noisy. And especially during office times, when you sit on your computer and then there are the people that just stand above you. And like, what are you doing? And some people, it gets them stressed. So if you're not supposed to be there, I will say, you help with one ABC. You finish with it, okay, thank you so much. We will call you if we need anything else. For the meantime, please go away. So that's about it.

11. Proactive Measures and Continuous Improvement

Short description:

Create on-call shifts handoffs, do a post-mortem and retrospective, create new tasks, modify alerts, update incident runbooks, check candidates for self-remediation.

Okay, so we covered mindset, we covered incident flow. Let's cover very quickly being proactive in the day-to-day and after an incident took place. And why do we need to come prepared? Because it doesn't matter if you're prepared or not, they will find you. And they are paid your duty, or any other apps.

So after the fact, what can you do? Create on-call shifts handoffs. On-call shifts handoffs are basically whenever you have a shift at work, write an on-call shift handoff of things that happened. Like I suppressed this false positive alert. I had a recurring issue. Alert X is waiting for dev to check it out. Write your shift summary, so your team members will benefit from it during their on-call shifts. And also for audit purposes, because it is kept in Slack, and everything can go back to it afterwards.

Post-mortem, as I mentioned before, even if there's no meeting, do a mental check. Do a retrospective with yourself and see what you could have done better. And if you have a post-mortem, write down the notes as soon as possible for a more efficient meeting. New tasks. We want to prevent the next incident from happening and stabilize the environment. So if you found anything that could help do that, create new tasks for that. Modify alerts. Fix any false positive alerts. Please don't wait for the next on-call to do it, because they will wait for the next on-call to do it. And they will wait, and then it will never happen. So please do it yourself. Incident runbooks, as I mentioned, write runbooks if you don't have them at all. You do have them. Make sure that they are up to date. Check any candidates for self-remediation. We have a bunch of alerts of disk space. It fills up to 90%. Maybe we can do things automatically to clean up the disk once in a while. So if you find out any candidates for self-remediation, this is the time to do it.

12. Preparation and Proactive Mindset

Short description:

Share knowledge, read on-call handoffs, be prepared, know escalation points, understand system architecture, learn application flows, be familiar with team member stats, be a go-to person.

And if the issue was handled, great. Share the knowledge more in depth than in the on-call handoff. Because that way everyone can learn from your line of thought.

And what can you do in your day-to-day in order to come prepared for an incident? The on-call shifts handoffs that I mentioned, read them on an ongoing basis. Why? Because production runs 24-7, not only when you are on-call. So if you want to be on top of things and get up to date, read these handoffs and be up to speed with what's going on in production. Plus, maybe you could also pitch in and help make things better by seeing other things from the side. It will help in certain scenarios.

Escalation point of contact. So you should know the needed pieces of information relevant to your realm infrastructure. But you should also know other realms as well and have the full picture. So let's say there's an issue with X. If you know that John is handling the service from the other side, then you know that you can escalate to him. So identifying service escalation points on a day-to-day basis and not only ad hoc when an incident occurs will save time and money on incident management and save someone else's hours of sleep because maybe I need to wake my team leader up to ask who is responsible for service X. So it could really help with debugging and save hours of sleep for anyone else.

Understand system architecture. Check for any weaker areas and vulnerabilities and any sensitive and blast radius scope because that way you will know what is prone to fail and you will have a go to fix it. So once you know system architecture, it will help you very much with the debugging and to solve the issues.

Learn application flows. So this is about flows between systems as opposed to the previous bullet which was about the flow and architecture of one system to know its ins and outs. So in here, learn application flows. If you know the application flows, it will help with troubleshooting because I know what needs to be checked, in which order, and it will contribute to the methodical debugging. It will also help you incorporate the business mindset because if you understand that escalation is needed, this issue is actually an incident, etc., then it will help with how to handle it.

Team member stats. As I mentioned before, production happens all the time and not through only your tasks. So be familiar with what your other team members are doing and how their changes affect production, if any, and this bullet is about 100% changes in production. So other tasks might not touch production, but deployment or changes in production definitely do. So ask about the change and its possible impact because, again, Ops Unit or PagerDuty doesn't care if you didn't do the change yourself. It will call you anyway if you are on call. So make sure you know exactly what was the change about and how to handle it.

And last but not least, be a go-to person. If you are a go-to person, you will get push notifications and decrease the need to fetch the updates on your own because people will come to you to update you on what's going on in production. So in order to really navigate a chaos and handle production incidents more efficiently, incorporate business mindset, make it a structured process, and be proactive. And that way you will come prepared to any incident that will cross your way and hopefully prevent the next incident from happening. And remember, less incidents means less downtime, means basic success. And business success is eventually your success. Plus, you get to preserve much-needed hours of sleep. Thank you very much.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them

React Advanced 2021

39 min

Don't Solve Problems, Eliminate Them

Top Content

Kent C. Dodds

Creator of EpicWeb.dev, EpicReact.Dev, TestingJavaScript.com

Kent C. Dodds discusses the concept of problem elimination rather than just problem-solving. He introduces the idea of a problem tree and the importance of avoiding creating solutions prematurely. Kent uses examples like Tesla's electric engine and Remix framework to illustrate the benefits of problem elimination. He emphasizes the value of trade-offs and taking the easier path, as well as the need to constantly re-evaluate and change approaches to eliminate problems.

remix best practices web development

Using useEffect Effectively

React Advanced 2022

30 min

Using useEffect Effectively

Top Content

David Khourshid

Stately

Today's Talk explores the use of the useEffect hook in React development, covering topics such as fetching data, handling race conditions and cleanup, and optimizing performance. It also discusses the correct use of useEffect in React 18, the distinction between Activity Effects and Action Effects, and the potential misuse of useEffect. The Talk highlights the benefits of using useQuery or SWR for data fetching, the problems with using useEffect for initializing global singletons, and the use of state machines for handling effects. The speaker also recommends exploring the beta React docs and using tools like the stately.ai editor for visualizing state machines.

react best practices state management react hook missing dependency

Design Systems: Walking the Line Between Flexibility and Consistency

React Advanced 2021

47 min

Design Systems: Walking the Line Between Flexibility and Consistency

Top Content

Siddharth Kshetrapal

GitHub

The Talk discusses the balance between flexibility and consistency in design systems. It explores the API design of the ActionList component and the customization options it offers. The use of component-based APIs and composability is emphasized for flexibility and customization. The Talk also touches on the ActionMenu component and the concept of building for people. The Q&A session covers topics such as component inclusion in design systems, API complexity, and the decision between creating a custom design system or using a component library.

best practices design systems component library

React Concurrency, Explained

React Summit 2023

23 min

React Concurrency, Explained

Top Content

Watch video: React Concurrency, Explained

Ivan Akulov

Google Developer Expert, Web Performance Consultant, Netherlands

React 18's concurrent rendering, specifically the useTransition hook, optimizes app performance by allowing non-urgent updates to be processed without freezing the UI. However, there are drawbacks such as longer processing time for non-urgent updates and increased CPU usage. The useTransition hook works similarly to throttling or bouncing, making it useful for addressing performance issues caused by multiple small components. Libraries like React Query may require the use of alternative APIs to handle urgent and non-urgent updates effectively.

react performance best practices react 18 deep dive react concurrent mode

Network Requests with Cypress

TestJS Summit 2021

33 min

Network Requests with Cypress

Top Content

Cecelia Martinez

Ionic

Cecilia Martinez, a technical account manager at Cypress, discusses network requests in Cypress and demonstrates commands like cydot request and SCI.INTERCEPT. She also explains dynamic matching and aliasing, network stubbing, and the pros and cons of using real server responses versus stubbing. The talk covers logging request responses, testing front-end and backend API, handling list length and DOM traversal, lazy loading, and provides resources for beginners to learn Cypress.

testing cypress

Managing React State: 10 Years of Lessons Learned

React Day Berlin 2023

16 min

Managing React State: 10 Years of Lessons Learned

Top Content

Watch video: Managing React State: 10 Years of Lessons Learned

Cory House

reactjsconsulting.com

This Talk focuses on effective React state management and lessons learned over the past 10 years. Key points include separating related state, utilizing UseReducer for protecting state and updating multiple pieces of state simultaneously, avoiding unnecessary state syncing with useEffect, using abstractions like React Query or SWR for fetching data, simplifying state management with custom hooks, and leveraging refs and third-party libraries for managing state. Additional resources and services are also provided for further learning and support.

react react query best practices indexeddb react react componentdidmount react performance react state management react swr react usetransition rtk query vs react query

Workshops on related topic

React Performance Debugging Masterclass

React Summit 2023

170 min

React Performance Debugging Masterclass

Top Content

Featured Workshop

Ivan Akulov

Ivan’s first attempts at performance debugging were chaotic. He would see a slow interaction, try a random optimization, see that it didn't help, and keep trying other optimizations until he found the right one (or gave up).
Back then, Ivan didn’t know how to use performance devtools well. He would do a recording in Chrome DevTools or React Profiler, poke around it, try clicking random things, and then close it in frustration a few minutes later. Now, Ivan knows exactly where and what to look for. And in this workshop, Ivan will teach you that too.
Here’s how this is going to work. We’ll take a slow app → debug it (using tools like Chrome DevTools, React Profiler, and why-did-you-render) → pinpoint the bottleneck → and then repeat, several times more. We won’t talk about the solutions (in 90% of the cases, it’s just the ol’ regular useMemo() or memo()). But we’ll talk about everything that comes before – and learn how to analyze any React performance problem, step by step.
(Note: This workshop is best suited for engineers who are already familiar with how useMemo() and memo() work – but want to get better at using the performance tools around React. Also, we’ll be covering interaction performance, not load speed, so you won’t hear a word about Lighthouse 🤐)

react performance best practices advanced debug react debugger react performance react profiler

React Hooks Tips Only the Pros Know

React Summit Remote Edition 2021

177 min

React Hooks Tips Only the Pros Know

Top Content

Featured Workshop

Maurice de Beijer

The addition of the hooks API to React was quite a major change. Before hooks most components had to be class based. Now, with hooks, these are often much simpler functional components. Hooks can be really simple to use. Almost deceptively simple. Because there are still plenty of ways you can mess up with hooks. And it often turns out there are many ways where you can improve your components a better understanding of how each React hook can be used.You will learn all about the pros and cons of the various hooks. You will learn when to use useState() versus useReducer(). We will look at using useContext() efficiently. You will see when to use useLayoutEffect() and when useEffect() is better.

react best practices react hooks deep dive react 18 hooks react profiler

React, TypeScript, and TDD

React Advanced 2021

174 min

React, TypeScript, and TDD

Top Content

Featured Workshop

Paul Everitt

ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.

react best practices typescript devtools web development test driven development react

Master JavaScript Patterns

JSNation 2024

145 min

Master JavaScript Patterns

Top Content

Featured Workshop

Adrian Hajdin

During this workshop, participants will review the essential JavaScript patterns that every developer should know. Through hands-on exercises, real-world examples, and interactive discussions, attendees will deepen their understanding of best practices for organizing code, solving common challenges, and designing scalable architectures. By the end of the workshop, participants will gain newfound confidence in their ability to write high-quality JavaScript code that stands the test of time.
Points Covered:
1. Introduction to JavaScript Patterns2. Foundational Patterns3. Object Creation Patterns4. Behavioral Patterns5. Architectural Patterns6. Hands-On Exercises and Case Studies
How It Will Help Developers:
- Gain a deep understanding of JavaScript patterns and their applications in real-world scenarios- Learn best practices for organizing code, solving common challenges, and designing scalable architectures- Enhance problem-solving skills and code readability- Improve collaboration and communication within development teams- Accelerate career growth and opportunities for advancement in the software industry

best practices javascript patterns

Designing Effective Tests With React Testing Library

React Summit 2023

151 min

Designing Effective Tests With React Testing Library

Top Content

Featured Workshop

Josh Justice

React Testing Library is a great framework for React component tests because there are a lot of questions it answers for you, so you don’t need to worry about those questions. But that doesn’t mean testing is easy. There are still a lot of questions you have to figure out for yourself: How many component tests should you write vs end-to-end tests or lower-level unit tests? How can you test a certain line of code that is tricky to test? And what in the world are you supposed to do about that persistent act() warning?
In this three-hour workshop we’ll introduce React Testing Library along with a mental model for how to think about designing your component tests. This mental model will help you see how to test each bit of logic, whether or not to mock dependencies, and will help improve the design of your components. You’ll walk away with the tools, techniques, and principles you need to implement low-cost, high-value component tests.
Table of contents- The different kinds of React application tests, and where component tests fit in- A mental model for thinking about the inputs and outputs of the components you test- Options for selecting DOM elements to verify and interact with them- The value of mocks and why they shouldn’t be avoided- The challenges with asynchrony in RTL tests and how to handle them
Prerequisites- Familiarity with building applications with React- Basic experience writing automated tests with Jest or another unit testing framework- You do not need any experience with React Testing Library- Machine setup: Node LTS, Yarn

react testing best practices deep dive react testing react testing library test driven development react

Detox 101: How to write stable end-to-end tests for your React Native application

React Summit 2022

117 min

Detox 101: How to write stable end-to-end tests for your React Native application

Top Content

Workshop

Yevheniia Hlovatska

Compared to unit testing, end-to-end testing aims to interact with your application just like a real user. And as we all know it can be pretty challenging. Especially when we talk about Mobile applications.
Tests rely on many conditions and are considered to be slow and flaky. On the other hand - end-to-end tests can give the greatest confidence that your app is working. And if done right - can become an amazing tool for boosting developer velocity.
Detox is a gray-box end-to-end testing framework for mobile apps. Developed by Wix to solve the problem of slowness and flakiness and used by React Native itself as its E2E testing tool.
Join me on this workshop to learn how to make your mobile end-to-end tests with Detox rock.
Prerequisites- iOS/Android: MacOS Catalina or newer- Android only: Linux- Install before the workshop

testing react native e2e testing beginner friendly react native accessibility react native detox react native test automation react native testing