English versionEN

Debugging a Non Reproducible Crash

POV: Your app has a crash affecting thousands of users, but for the life of you, you can't reproduce it and have no idea what's causing it. Hear the story of an epic struggle to vanquish a non reproducible bug and learn what to do (and what not to do) when facing such a foe.

This talk has been presented at React Advanced 2021, check out the latest edition of this React Conference.

FAQ

The speaker of the story is Alex, a tech lead at BAM.

The app was preparing for an October 11th live event.

The team used Sentry as their crash reporting tool.

The exception causing the app to crash was a JS application illegal argument exception error while updating a style property in a shadownode of a React Native component.

The crash rate of the app was about 10%.

The two native dependencies upgraded were React Native SVG and native navigation.

The team attempted to identify the cause of the crash by analyzing the stack trace, using debugger tools, and testing the app on multiple devices.

The ultimate cause of the crash was a race condition in the React Native SVG library, which was using a function from the main thread that was not thread-safe.

The team learned the importance of using crash reporting tools extensively, monitoring release health, rolling out updates to a subset of users, and digging deeper into the source code to understand and fix bugs.

The team decided to roll out updates to 10% of users first to monitor the release health and protect the majority of users from potential crashes.

Alexandre Moureaux

20 min

25 Oct, 2021

Comments

Video Summary and Transcription

The Talk discusses a vicious bug that caused 20,000 crashes in a JS application. The bug was an array index out of bounds exception in the SimplePool class. The team used a debugger to analyze the bug and discovered a race condition caused by an upgrade to React Native SVG. They collaborated with React Native contributors to fix the issue and deployed a patched version. The Talk emphasizes the importance of using a crash reporting tool, monitoring release health, and learning from bugs and source code analysis.

Available in Español: Depurando un Bloqueo No Reproducible

1. The Story of the Vicious Bug

Short description:

Today, I'm going to tell you a story, the story of a bug and our fight against this bug. A bug so vicious and cruel that it actually caused us no less than 20,000 crashes. Our crash rate goes up significantly and our crash reporting tool is reporting an exception every minute. It's a JS application illegal argument exception error while updating a style property in a shadownode of type a React Native component. It happens for every user, every Android device, and all Android devices are affected.

Today, I'm going to tell you a story, the story of a bug and our fight against this bug. A bug so vicious and cruel that it actually caused us no less than 20,000 crashes. But introductions first.

Hi, everyone. I'm Alex. I'm very excited to be here at React Advanced London. I'm a tech lead at BAM. We're based in Paris and we develop mobile apps in Flutter, Native, and of course React Native. And our story begins in October. And we're a team of nine people, and we're very happy and proud to release version 4.3 of our app. Why are we so happy and proud? Well, because actually we were getting ready for our October 11th live event that the app was covering, and we were adding a lot of essential features to the app. Super. We're super happy.

But then, the unexpected hope occurs. Suddenly, our crash rate actually goes up significantly. And actually, our crash reporting tool that we're using, Sentry, is under heavy fire. It's reporting an exception every minute, then, a lot of exceptions every minute. Then, it's basically an exception every second, and it's getting overwhelming. And all of those exceptions are a bit different, but they're all kind of have the same shape. They're like this. Basically, it's a JS application illegal argument exception error while updating a style property in a shadownode of type a react Native component.

And so, well, first thought, is like, well, you know, we did QA this release, we did test it out a lot. Why did we not see this happening? And also, if you search a bit more about this error, this tends to happen if you set a wrong value to a style. For example, if I set padding top to NAN, not a number, this is what would occur. So it kind of sounds like something quite easy to detect. So, like, well, maybe it happens only in certain extreme cases that we have not tested properly before. But it turns out that Sentry is basically reporting that it happens for every user, every Android device so this is an Android issue only, but all Android devices are affected. And also in our app you can actually favor the team, for example, to change a bit the experience of the app. But it doesn't matter whichever team you're actually favoriting doesn't impact this. You're getting the crash.

2. Analyzing the Crash and Reproduction Attempts

Short description:

We have a big crash on startup that affects any device and user. We couldn't reproduce it, so we analyzed the stack trace and found an array index out of bound exception in the SimplePool class. Rolling back the release was not an option, as it had high user value. With a 10% crash rate, we attempted to reproduce the issue on multiple devices but didn't get any crashes.

All right. Well, we have a big crash, we have a big fire to put out, so let's start by trying to reproduce the crash, right? So fortunately we configured for Sentry or crash reporting tool to tell us what the user was doing before triggering the crash. So here we see that the user is actually opening the app, starting the first screen of the app, which is called Home. And boom, actually it crashes instantly.

All right, so basically you're telling me that it affects any device, it crashes on startup, it affects any user and we can't reproduce it? We've never seen it before, how is that even possible?

All right, well I guess step two, if you can't really reproduce, is analyzing the stack trace. So let's take a look. Okay, I did say that we have several different errors. I guess, let's take a look at the first one. So this one is an array index out of bound exception. It's a Java error. And it's happening in the class called SimplePool and it's a class from Android v4 support library. And it's happening in SimplePool.release, like 116 of pools.java. And well, to be honest, at this point I'm like I don't even know what SimplePool is. And I don't even know why I'm even in the Android source code. Like there's a big fire to put out and it feels like it's going to take a lot of time to actually figure out what's going on because I don't really understand this. So I guess let's find an easier solution to put out the fire.

So one idea would be, well, could we just roll back our release? Well, if you're a mobile app developer you know that we can't actually really roll back the release? We actually have to deploy a new release with the old code. It's kind of annoying and it means that certain users, you know, the users will get an update of the app just reverting everything. And at this point in time, we actually know that our crash rate is about 10%. So it seems that basically a user opening the app has one out of 10 chances to crash the app. But it seems that whenever they try to restore it, it works. And also, this release has actually great value for users. It turned out to be one of the highest-rated releases despite this outstanding crash. So we thought, well, no, let's not roll back. It's not the end of the world. It's outrageously big to have 10% crash rate, but let's try to fix it in another way.

All right, we know that the crash rate is 10%, so I'm like, okay, I can devise a battle plan. I'm just going to take six Android devices, I'm going to trigger with a script 10 app launch per device, so statistically I should get like five to 10 crashes, right? And at least that would be some kind of reproduction. I would be able to finally see the issue, and if I get a fix, then I would be able to test it out. The result was that I didn't get any crashes.

3. Investigating Native Dependencies and Testing

Short description:

Our previous release was not crashing, this release is crashing. We upgraded two native dependencies: React Native SVG and native navigation. We suspect native navigation as the culprit behind the crashes. We can roll out a new version to 10% of our users to test if the new release fixes the crash. If it doesn't, we can downgrade the SVG library. If neither fix the crash, it could lead to potential uninstalls.

None whatsoever. Quite unlucky. So okay, I guess we need to find something else. So another idea was what actually changed. Our previous release was not crashing, this release is crashing. So what did we introduce between the two releases that actually crashed the app?

So my thought at this point was maybe we should take a look at the native dependencies we upgraded. Because well, this is a Java exception, so it happens in the native code, so probably the culprits is a native dependency that we upgraded. And it turns out that we upgraded two native dependencies since the last release. First one was React Native SVG, and the second one was native navigation. So you probably don't know native navigation. So it's actually a fork that we made from an Airbnb navigation library, which is using well, native navigation. And it turns out that we ourselves added some features to improve the performance at startup, right. So it sounds like a very nice culprit, you know, we upgraded it to improve the performance at startup, we get crashes at startup. Okay, it sounds like this one should be the culprit behind our crashes.

So as you may know, in the Play Store, you can actually roll out a new version of your app to only a subset of your user. So for example, you can just roll out the new version to like 10% of your users. So that allows us to devise a new battle plan. If native navigation is actually the culprit, we can just test it out. We don't create native navigation. We release a new version that we roll out only for 10% of our users. We check back, we should be able to see in like an hour or so if the new release is actually successful. And if it is successful, then we roll out to everybody the new release because, well, the crash is fixed, yay. Yeah, but what if it actually doesn't fix the crash? Okay, I guess in this case, let's just downgrade the other one, the SVG library, and, well, we do the same. We roll out for 10% of our users. We check back. If success, yay, full rollout, okay, cool, we won. But what if again that didn't fix the crash? So this would actually mean that if it still doesn't fix the crash, it would mean that we upgraded twice our app and every time, each time, 10% of our users got an update which actually didn't do anything and didn't fix everything, didn't fix the crash. That's actually a source of potential uninstall, like when a user gets a lot of upgrades of his app but it doesn't do anything for him. It happens sometimes that a user actually uninstalled the app because of this. So to be honest, that plan is yeah, it's kind of dumb.

4. Analyzing the Bug and Using the Debugger

Short description:

Our bug was an array index out of bounds exception in the SimplePool class. We tried to access an array at index mPoolSize, which was -1. The only place where mPoolSize changes is in the acquire function, and it is protected from being below zero. We decided to use the debugger to investigate further.

So, all right. I guess at this point, yeah we need to go deeper. We really need to understand the bug and we really need to analyze it. So let's take a look again. Our bug, as you recall, was an array index out of bounds exception. All right. Let's take a look at where it was happening. It was happening, as you might remember, in a class called SimplePool inside the Android v4 support library code. And basically the bug was this. We have an array of objects called mPool and we have an index called mPoolSize and we're trying to access this array at index mPoolSize which apparently equals minus one. So now you don't need to be a Java expert developer to know that accessing an array at index minus one is really not a good idea. So you can understand where the crash is coming from. So mPoolSize value is minus one which is not good. So the question now is what actually can modify mPoolSize? So mPoolSize is actually modified only in this place. It's initialized to ten and then it's only being decreased in this function called acquire inside simplePool. This is the only place where mPoolSize actually changes in this function and it gets decreased but you might notice something right there's actually a condition there to protect it from being below zero. Here is if mPoolSize is over zero then decrease mPoolSize. So it kind of sounds impossible that mPoolSize would become minus one because if mPoolSize is zero you cannot decrease it even further so that really sounds impossible.

5. Analyzing the Bug with the Debugger

Short description:

We used the debugger in Android Studio to analyze the function acquire in the React Native code. We discovered that the function dynamic from map create could be called from different threads, leading to a race condition. The SVG upgrade caused the bug.

Okay, I guess it's time to bring out our ultimate weapon, and of course, I'm talking about the debugger. So, alright, let's open Android Studio and go to the famous function, acquire. In the stack traces, we noticed that this function was called from the React Native code in the class called dynamic from map and the function create. We put a breakpoint there and observed that the threads reported by Android Studio were MQT native modules, except for the 34th hit, which was the main thread. This gave us a clue that the function dynamic from map create could be called from different threads. We also discovered that the property being updated was an SVG property, not a React Native style property. It turns out that the SVG upgrade caused the bug.

6. Analyzing the Bug and Fixing the Issue

Short description:

On hit 34, the main thread was used, giving a big clue. The function dynamic from map create can be called from different threads. The bug was caused by the SVG upgrade, where an impossible condition occurred due to thread safety. We fixed it by collaborating with React Native contributors and deploying a patched version to 10% of our users. Takeaway: Use your crash reporting tool extensively and configure it to capture user actions.

But on hit 34 the thread that was used is the main thread so basically what this means is so I was actually not triggering the bug in this case but this gave me a very big clue. This function dynamic from map create could be called from different threads. If we take a look at hit 34 actually we notice that in this case the property being updated was a property called fill and well this really doesn't sound like a React Native style property right? Indeed, it's actually an SVG property. So it was the SVG upgrade all along that actually caused this bug. So let's see what actually can happen.

So React Native SVG, we upgrade it to v7 and they started using this code dynamic from app create to improve the performance of native SVG animations. But they were using it from the main thread while React Native was actually using it from MQT native modules. So what can actually happen? This impossible condition, well when you have something impossible happening in Java it's usually because of thread safety. You know as JavaScript developers we're not really used to having multiple threads and having to deal with that, but when you do React Native you get also Java in the mix. So you get thread safety in the mix. So here this impossible condition, well it could happen that two threads, thread A and thread B could actually go pretty much at the same time on the condition if amplesize over zero and think it is over zero, and then they both enter the condition at the same time and so it means that they both decrease it. It's kind of like this. Thread A sees amplesize over zero is true, cool, but it doesn't have time to decrease yet, it doesn't have time to get out of the function yet because thread B is actually entering the condition as well and checking amplesize over zero is true. And if at the beginning amplesize is one, then it's again one when we check the condition for thread B. And then, what happens is they actually both decrease amplesize, so it becomes zero and then minus one. So, wow, we actually know where this is coming from and this is actually why it was so hard to reproduce because this is a race condition that was very tricky to actually trigger. So let's fix.

So when we investigated, we found that there was a pull request on React Native dealing with this, dealing with thread safety actually on dynamic from map creates. And so, with collaboration with the React Native core contributor that submitted the pull request and React Native SVG maintainers, we devised a final battle plan. We patched React Native locally, we deployed this version to 10% of our users just in case to check, and then, of course, check back. Was it successful? Yes. Finally. We fixed it and our crash rate was back to normal. Whoo! Alright. Well, this was fantastic. But maybe a few takeaways from this. First one is this. You should use your crash rewarding tool extensively, and you should configure it to be able to use it because you're going to get crashes in production, and probably you're going to get crashes that you can't reproduce. So you should know what the user is doing before triggering the crash. Out of the box, you're not necessarily going to have this on your crash rewarding tool, so you should set it up so that it's easy to see, for example, the screens that your user is navigating to.

7. User Details, Release Health, and Learning

Short description:

You should add details about the user, monitor release health, and protect your users. Rolling out releases to 10% of users allows for monitoring and minimizing impact. Digging deeper into bugs and source code provides valuable learning experiences.

You should also add as many details about the user as possible, of course, in a GDPR-friendly manner. For example, in our case, adding what teams the user was actually favoriting to change his experience, because sometimes you trigger bugs only in certain cases in your app, of course.

Then you should, of course, monitor your release health. 10% crash rate, of course, is outstanding. It's really, really bad. 0.2% crash rate is a bit better. The market standard is about 0.3, 0.4 for Androids. It's even lower for iOS.

And if you actually do that, it also allows you to do one thing, protect your users. And that's what we did after this. Every time we're deploying a new release, we were actually rolling it out to 10% of our users. Of course, we should never have crashes, outstanding crashes like this in those 10%, but in case it actually does happen, at least we impacted only 10% of our users. So, the rest of the users, they have no impact. And of course, that means you're able to know if the release was successful, so you're able to monitor the health of your release.

And of course, you have time between the initial rollout and, for example, in our case, we had the live event on October 11. We did the release on October 9. Not really a good idea.

The final one is this. You can actually learn a lot by digging deeper. I've never learned as many things as when I was actually going through a bug that I could not reproduce, and I dived in deeper into the source code of libraries I was using, and every time, I learned so much.

And that's it. Thank you for watching, and do hit me up if you have any questions on the Discord channel or on Twitter.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them

React Advanced 2021

39 min

Don't Solve Problems, Eliminate Them

Top Content

Kent C. Dodds

Creator of EpicWeb.dev, EpicReact.Dev, TestingJavaScript.com

Kent C. Dodds discusses the concept of problem elimination rather than just problem-solving. He introduces the idea of a problem tree and the importance of avoiding creating solutions prematurely. Kent uses examples like Tesla's electric engine and Remix framework to illustrate the benefits of problem elimination. He emphasizes the value of trade-offs and taking the easier path, as well as the need to constantly re-evaluate and change approaches to eliminate problems.

remix best practices web development

Modern Web Debugging

JSNation 2023

29 min

Modern Web Debugging

Top Content

Jecelyn Yeen

Google (Chrome DevTools)

This Talk discusses modern web debugging and the latest updates in Chrome DevTools. It highlights new features that help pinpoint issues quicker, improved file visibility and source mapping, and ignoring and configuring files. The Breakpoints panel in DevTools has been redesigned for easier access and management. The Talk also covers the challenges of debugging with source maps and the efforts to standardize the source map format. Lastly, it provides tips for improving productivity with DevTools and emphasizes the importance of reporting bugs and using source maps for debugging production code.

devtools debug

Jotai Atoms Are Just Functions

React Day Berlin 2022

22 min

Jotai Atoms Are Just Functions

Top Content

Daishi Kato

Zustand, Jotai & Waku author

State management in React is a highly discussed topic with many libraries and solutions. Jotai is a new library based on atoms, which represent pieces of state. Atoms in Jotai are used to define state without holding values and can be used for global, semi-global, or local states. Jotai atoms are reusable definitions that are independent from React and can be used without React in an experimental library called Jotajsx.

state management web development builders and founders jotai react jotai react native

Debugging JS

React Summit 2023

24 min

Debugging JS

Top Content

Watch video: Debugging JS

Mark Erikson

Replay.io

Debugging JavaScript is a crucial skill that is often overlooked in the industry. It is important to understand the problem, reproduce the issue, and identify the root cause. Having a variety of debugging tools and techniques, such as console methods and graphical debuggers, is beneficial. Replay is a time-traveling debugger for JavaScript that allows users to record and inspect bugs. It works with Redux, plain React, and even minified code with the help of source maps.

best practices case study javascript web development debug

The Epic Stack

React Summit US 2023

21 min

The Epic Stack

Top Content

Watch video: The Epic Stack

Kent C. Dodds

Creator of EpicWeb.dev, EpicReact.Dev, TestingJavaScript.com

This Talk introduces the Epic Stack, a project starter and reference for modern web development. It emphasizes that the choice of tools is not as important as we think and that any tool can be fine. The Epic Stack aims to provide a limited set of services and common use cases, with a focus on adaptability and ease of swapping out tools. It incorporates technologies like Remix, React, Fly to I.O, Grafana, and Sentry. The Epic Web Dev offers free materials and workshops to gain a solid understanding of the Epic Stack.

react web development builders and founders future of development epic react

A Framework for Managing Technical Debt

TechLead Conference 2023

35 min

A Framework for Managing Technical Debt

Workshops on related topic

React Performance Debugging Masterclass

React Summit 2023

170 min

React Performance Debugging Masterclass

Top Content

Featured Workshop

Ivan Akulov

Ivan’s first attempts at performance debugging were chaotic. He would see a slow interaction, try a random optimization, see that it didn't help, and keep trying other optimizations until he found the right one (or gave up).
Back then, Ivan didn’t know how to use performance devtools well. He would do a recording in Chrome DevTools or React Profiler, poke around it, try clicking random things, and then close it in frustration a few minutes later. Now, Ivan knows exactly where and what to look for. And in this workshop, Ivan will teach you that too.
Here’s how this is going to work. We’ll take a slow app → debug it (using tools like Chrome DevTools, React Profiler, and why-did-you-render) → pinpoint the bottleneck → and then repeat, several times more. We won’t talk about the solutions (in 90% of the cases, it’s just the ol’ regular useMemo() or memo()). But we’ll talk about everything that comes before – and learn how to analyze any React performance problem, step by step.
(Note: This workshop is best suited for engineers who are already familiar with how useMemo() and memo() work – but want to get better at using the performance tools around React. Also, we’ll be covering interaction performance, not load speed, so you won’t hear a word about Lighthouse 🤐)

react performance best practices advanced debug react debugger react performance react profiler

React, TypeScript, and TDD

React Advanced 2021

174 min

React, TypeScript, and TDD

Top Content

Featured Workshop

Paul Everitt

ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.

react best practices typescript devtools web development test driven development react

Web3 Workshop - Building Your First Dapp

React Advanced 2021

145 min

Web3 Workshop - Building Your First Dapp

Top Content

Featured Workshop

Nader Dabit

In this workshop, you'll learn how to build your first full stack dapp on the Ethereum blockchain, reading and writing data to the network, and connecting a front end application to the contract you've deployed. By the end of the workshop, you'll understand how to set up a full stack development environment, run a local node, and interact with any smart contract using React, HardHat, and Ethers.js.

react blockchain web development ethereum web3

Tracing: Frontend Issues With Backend Solutions

React Summit US 2024

112 min

Tracing: Frontend Issues With Backend Solutions

Top Content

Featured WorkshopFree

2 authors

Frontend issues that affect your users are often triggered by backend problems. In this workshop, you’ll learn how to identify issues causing slow web pages and poor Core Web Vitals using tracing.
Then, try it for yourself by setting up Sentry in a ready-made Next.js project to discover performance issues including slow database queries in an interactive pair-programming session.
You’ll leave the workshop being able to:- Find backend issues that might be slowing down your frontend apps- Setup tracing with Sentry in a Next.js project- Debug and fix poor performance issues using tracing
This will be a live 2-hour event where you’ll have the opportunity to code along with us and ask us questions.

next.js debug

Remix Fundamentals

React Summit 2022

136 min

Remix Fundamentals

Top Content

Workshop

Kent C. Dodds

Building modern web applications is riddled with complexity And that's only if you bother to deal with the problems
Tired of wiring up onSubmit to backend APIs and making sure your client-side cache stays up-to-date? Wouldn't it be cool to be able to use the global nature of CSS to your benefit, rather than find tools or conventions to avoid or work around it? And how would you like nested layouts with intelligent and performance optimized data management that just works™?
Remix solves some of these problems, and completely eliminates the rest. You don't even have to think about server cache management or global CSS namespace clashes. It's not that Remix has APIs to avoid these problems, they simply don't exist when you're using Remix. Oh, and you don't need that huge complex graphql client when you're using Remix. They've got you covered. Ready to build faster apps faster?
At the end of this workshop, you'll know how to:- Create Remix Routes- Style Remix applications- Load data in Remix loaders- Mutate data with forms and actions

remix web development

Vue3: Modern Frontend App Development

Vue.js London Live 2021

169 min

Vue3: Modern Frontend App Development

Top Content

Workshop

Mikhail Kuznetsov

The Vue3 has been released in mid-2020. Besides many improvements and optimizations, the main feature of Vue3 brings is the Composition API – a new way to write and reuse reactive code. Let's learn more about how to use Composition API efficiently.

Besides core Vue3 features we'll explain examples of how to use popular libraries with Vue3.

Table of contents:
- Introduction to Vue3
- Composition API
- Core libraries
- Vue3 ecosystem

Prerequisites:
IDE of choice (Inellij or VSC) installed
Nodejs + NPM

web development vue composition api vue vue 3