Video Summary and Transcription
The Talk discusses an incident where a React Native release caused broken builds and how it was fixed. The incident occurred due to the NPM package becoming too big, leading to the move of Android artifacts to Maven central. The use of dynamic versions and the plus dependency in React Native were identified as contributing factors to the problem. Lessons learned include the importance of removing plus dependencies and the need for better recommendations for creating resilient libraries.
1. The Day I Broke React Native
Hi, everyone. Today I want to tell you a story of a rainy November day from last year in Seattle. People started reporting broken React native builds, and we discovered an upcoming version of React native causing the issue. I, Nicola Corti, an Android engineer at Meta, will walk you through the incident and how we fixed it.
Hi, everyone. So today I want to tell you a story. It's a story of a rainy November day from last year. I was in Seattle. If you have ever been to Seattle, please make sure you check out the Starbucks Reserve Roastery. It's a special Starbucks where they do coffee tasting. If you're into coffee, you definitely want to check it out. And I was there. I was checking my email, checking my GitHub notifications. And yeah, everything looked great. But then people started messaging me that for some reason their builds, their React native builds were broken. And well, I was on Discord. So let me check what actually is going on.
And on React native, you will run Android to run the Android app. And people started reporting like error messages out of the blue, in a really terrible way. Like imagine your build was working one minute ago, then you build again, you don't make any code changes and your build is broken. This is terrible from the developer experience point of view. And obviously it should not happen. Then we started looking into like, hey, why those builds are actually breaking? And we realized that in the error message, there was a mention of an upcoming version of React native, like 0.710.rc0. And even for users that were on previous versions of React native, like on 68 and 69 and so on. At that point, we realized, yeah, I think we have a problem. And I personally had a really big problem because I was supposed to fly back to London in a couple of hours. So okay, how do we fix it?
So my name is Nicola Corti. I work as an Android engineer in the React native team at Meta. And today I'm excited to tell the story of the day when I broke React native. So to fully understand what was this, I will walk you through what happened, like what was the real incident, why it broke and how we actually fixed it. So the disclaimer here is, well, the incident was an Android, but we effectively broke the build for everyone. So well, not many people here use React native, but trust me, there is a lot of lessons learned that applies to any technology out there. So let's start from what happened. So inside React native and inside the React native team, we have this group of people called the release crew.
2. React Native Release Process
The team responsible for React Native releases launches the Bump.OSS version script with the upcoming version. The first release candidate, RC0, is sent out for testing. In 0.71, the NPM package became too big, leading to the move of Android artifacts to Maven central.
They're responsible for crafting new releases of React Native every four to six months. They launch the script called Bump.OSS version from the console with the upcoming version they intend to release. When a new branch is cut, they do the RC0, which is the first release candidate. The first release candidate is generally unstable and is sent out to the market for testing. In 0.71, there were a lot of changes, including the RFC 508, which provided an alternative solution for React Native artifacts. The NPM package of React Native was getting too big, so the Android artifacts were moved to Maven central, where Android libraries are distributed.
3. Moving Android Artifacts to Maven Central
One change which is really relevant for this incident is this RFC 508, which is out of NPM package solution for React native artifacts. The NPM package of React native was getting too big, so we decided to move the Android artifacts from the NPM package to Maven central. The incident happened because the package was getting too big, and we had to remove the binaries from the NPM package.
One change which is really relevant for this incident is this RFC 508, which is out of NPM package solution for React native artifacts. So practically speaking, the NPM package of React native was getting too big. We could not fit there anymore, and I will get into it in a minute. But we had to find an alternative solution where the Android artifacts could be shipped.
When you do native, you don't just distribute code, you distribute binaries, and those binaries get quite big, and they can get hundreds of megabytes. So we decided to move the Android artifacts from the NPM package to Maven central, which is where Android libraries are generally distributed. So this website, which is quite archaic by judging by the amount of CSS they use, is actually Maven central. So it's nothing more than basically an S3 bucket where Android libraries are stored. And this is where React native is actually stored and people that build React native get it from there. And if you look into the list of versions, you will actually see that, well, we used to publish on Maven central back in the day. If you look at the versions between 0.11 and 0.20, which is like 2015, 2016, we used to publish there. But then at some point we moved over because publishing on Maven central was too complicated, and so we said, okay, let's just do a monolith package and put everything inside the React native package. Well, then we had to revert this decision because the package was getting too big and we published 0.71.0.rc0, which you see just below there, published on November 2022. Here.
So let's try to understand why the incident happened in the first place. Because so far it seems reasonable. When we were publishing on Maven central in the past, we were going back there because the package was getting too big. Why things broke. So let's go back to this RFC and let's look closer inside the npm package. As I said, the Android folder was the biggest offender here. We're talking about 66 megabyte and in 0.71 we were adding debug symbols and more things to improve the Android developer experience, which made the npm package get bigger. It was getting beyond 200 megabytes. Fun fact, you can't publish npm packages which are above 220 megabytes or so because npm will return you HTTP 4.1.3 content too large. So npm was just not an option. So we had to remove those binaries from the npm package. The underlying issue that triggered this incident is what we called a plus dependency, which is really similar to a star dependency on npm. So for Android, we use Gradle to build, and inside the Gradle file we have a block where we describe what are our dependencies, like which libraries we want to depend on. And one of these is React Native. This string here is called maven gav coordinates. I know this is like really specific to Android, but it's also quite easy to understand.
4. Understanding gav and Dynamic Versions
gav stands for group, artifact, and version. The default template for React Native apps includes Gradle files with interesting comments. One comment suppresses a warning for using dynamic versions, which are considered an antipattern in the Android space. Dynamic versions can lead to unreproducible builds and dependency on library maintainers.
So gav stands for group, which is the organization that is publishing a library, a for artifact, which is the name of the library, and v for version, which in this case was plus, which means just get the highest version ever that you find on any repository and use it.
So well, if we look into like this is from the default template. So when you create a new React Native app, you will have like Gradle file and things like those comments over here, which I think are a bit interesting.
So for example, there is one comment here just above that says no inspection Gradle dynamic version. This is a suppression of a warning for the line just below. The line below contains a plus, which, well, in the Android space is an antipattern. Specifically, there is this page on the Gradle documentation, which is called handling version with changes over time. So like dynamic versions. And only in the first, like if I take a screenshot of this page, they have two warnings that tell you like that dynamic versions are not great because they just lead to unreproducible builds. You're basically at the mercy of the library maintainer. If the library maintainer tomorrow publishes a new version, you're just going to get it and maybe your build breaks overnight. So not great, neither on NPM nor on any other platform that has similar concept.
5. Understanding the Plus Dependency in React Native
To understand why the problem occurred, we need to look at the plus dependency in React Native. Previously, the dependency was obtained from the NPM package, but it was later moved to another repository. The plus dependency retrieves the highest version available, causing issues when a higher version is published. This led to projects using version 0.71.0.rc0, which caused problems.
But why it worked? To fully understand, we need to see these other comment on the right, which says from node modules. So the plus dependency worked till React Native 0.70 because we were actually getting the dependency from the NPM package. If you start looking into what's inside node modules, React Native, add the Android folder, we will see that basically there we have a collection of artifacts. We have like a sources.jar, a pom file and so on. This is like Java stuff that is used to build the Android apps. I'm faking here a bit, like the list is actually quite longer. But regardless of that, this is the list of artifacts which is used by React Native to build an Android app, like the core of React Native. And if you look on Maven Central, that's actually the same content. So we just moved that folder from node modules to another repository. So now maybe you start understanding why the problem happened. The problem happened because plus dependency means get the highest. And as long as you add only one version locally inside your node modules and one version on Maven Central, which was smaller, in this case 0.20, things worked fine. But if I publish something on Maven Central or any other repository which has a higher version, that version would prevail on every project on this planet. So people started grabbing that version 0.71.0.rc0, which is not great.
6. Fixing the React Native Builds
To fix the issue, we released patches for all versions of React Native down to 0.63. This required a significant effort, as we had to work with branches from releases made years ago. We went the extra mile to provide a patch version of React Native that only included the necessary fixes. Additionally, we reached out to Sonotype, the company running Maven Central, to have the artifacts removed. Although it took some time, this was the definitive solution to the problem. From this experience, we learned the importance of removing plus dependencies.
So let's look into how we actually fixed it. Like how can we sort of fix the builds for everyone out there. So yeah, you might think, okay, so I go inside my project and I just remove the plus, no? I specify the version that I'm using, like 0.68, no? Like I've patched my local project. Well, that's true, but also a bit naive because on React Native, we do have a lot of like every like node projects, we rely a lot on external dependencies. So for example, one really popular library for React Native is reanimated, like an animation library. And they also have a griddle file. And inside their griddle file, they also depend on React Native. And well, reanimated doesn't want to depend on a specific version of React Native, they just want to get the one that the user is using. So they also have a plus dependency in their griddle file. That means that even if your project specifies 68, reanimated would say like, no, no, not 68, give me the highest, I want the one. So basically every library was contributing to breaking the project even further.
So how we fix it? I was basically on the plane, like waiting to depart and I opened a GitHub issue trying to explain like, what is the problem, and with a combination of patch packages and so on, how we can effectively mitigate this issue on your project. But this was not ideal, like the fact that you had to use patch package or do like crazy editing in your node modules folder is not ideal. You should not be doing that. So together with the rest of the release crew and folks from the community, we released patches for all the versions of React Native down to 0.63. And this was quite a big effort, because imagine that you have like a branch for a release that you did like three years ago. You don't know if the CI is working, you don't know what's the status there. And you want to attempt to release a new version from there. Not easy. Well, we made it and we really went the extra mile, like people worked overnight to fix it so that you will just have a patch version of React Native containing only the fixes necessary to sort of resolve the issue. And we went down to 63 because we keep an eye on the market share of the various versions of React Native that you folks download from NPM. And 63 allowed us to cover 99% of the downloads. So basically we were able to sort of fix it for everyone. The definitive fix, actually, was to reach out to Sonotype, which is the company that is running Maven Central and ask them to remove the artifacts. This took a little bit longer, like took two days, because initially we thought, okay, they're never going to remove that. Like Maven Central is an immutable data store, you're not supposed to delete libraries. But in this case, this was the only solution to this problem. So they helped us a lot also in fixing this.
So now I want to share a couple of lessons learned, like things that me and the rest of the release crew and the React Native team take home from this experience. So first, a lot of the, like a lot of those plus dependencies have been removed.
7. React Native Support and Incident Culture
We have implemented fixes in our Android and iOS infrastructure to prevent similar problems. We now have a release support window that declares which versions of React Native we support. This covers nearly 70-75% of the market share and provides a one-year window for receiving patches and security updates. We acknowledge that our incident response time was slow in this open-source incident. At Meta, we use SEV levels to express the gravity of an incident, with SEV2 being the level for major problems. Libraries also contribute to the problem by copying patterns that may contain anti-patterns.
Like we have fixes in place inside our both Android and iOS infrastructure to prevent similar problems like this one. And we also looked into implementing what is called a release support window. So historically, you were recommended to be on the latest React Native version, which yeah works, but that's not always possible because your project might lag behind a bit. So now we are like more intentional on which versions of React Native are actually supported. So if you look into the React Native releases working group for React, you would see that we declare which versions of React Native we support. And we support, we mean that if you find a bug in one of that version, we are committed to fixing it and releasing a patch for you. This covers nearly 70, 75% of the market share so far. And this allows to cover like an entire year of releases. So that means that although you should update your React Native version and any version of dependency that you use, you have one year window to be able to always receive like patch and security patches and so on.
Another thing that we learned is like the incident response time. I personally think that we were quite slow to respond here. The problem is that this was like an open source incident. We don't have any telemetry on React or React Native, so we don't know how things are going for you. Like we don't know if your builds are broken. We don't know which dependencies you use and so on and so forth. So it like the fact that someone told me that their build is broken, I don't have a sense that this means that every build in the planet is broken. So I want to touch a little bit on the incident culture at Meta to let you understand how we try to integrate within the Meta culture and the open source space. So at Meta, we use SEV levels, which are like a market standard to express the gravity of an incident. We have SEV4, which is the lowest level, which is just an adds up. We open a SEV4 incident whenever there is something that might break, but maybe not. For example, like we did a huge migration of the monorepo structure of React Native, and whenever you move a lot of code, things can break. So that's why we opened a SEV4 for that case. We have SEV3, which means significant problem, resolution is moderate or high priority, like something broke, someone should look into it. SEV2, which is major problem, resolution is very high priority, like a significant group of people is affected. SEV1, red alert, whole hands on deck, like generally one of the Meta product is down. And then SEV0, which means company level crisis, like multiple products are down or things are really red. In this case, this was a SEV2 because we had the entire React Native open source ecosystem broken, and we had to get people woken up and find a fix as soon as possible.
Another thing that we took home is libraries best practices. As I said before, libraries were contributing to exacerbating the problem here because of every library has their own built logic, and they can do, like basically what happens is that there are patterns in the community that get copied over. Maybe you start a new library and you copy another library, and maybe there is an anti-pattern there that gets passed around in a group of libraries.
8. Investing in Create React Native Library
We are investing in the create React Native library to provide better recommendations for creating resilient libraries. Lessons learned: avoid shipping on Fridays, the release was lucky to happen over the weekend, and incidents on a plane are challenging. Read the postmortem on the React Native blog for more technical details.
That's why we are investing on create React Native library, which is our entry point to create new libraries. There is an RFC open on the website, which is the golden template for the create React Native library. So we want to offer better recommendations which are approved by Meta on how to create libraries for React Native that are resilient to incidents like this one.
And then, well, a couple of other lessons learned which are quite personal, but applies to everyone I think. The day when this script was invoked, it was a Friday. So even if you're doing mobile, don't ship on Friday. And well, in this particular case, actually, if you look at the release branch of React Native 71, whenever you see a revert bump version number, it's an attempt to publish a version which failed and got restarted. Yeah, so to be fair, this release was supposed to be released on Tuesday, and then they went to Thursday, and then they went to Friday, and they assumed that everything would work. Well, that's not the case. I think that in this particular scenario, actually, the fact that we released on Friday was luck. We got really lucky that an issue like this one arised over the weekend so that we had time over the weekend to prepare patch releases for all the versions of React Native across the market. But if this would have happened on Monday morning, we would have disrupted the workflow of banks, of people that are using React Native in production, and they cannot because of issues like this one. So I think now we don't do releases on Friday just to be safe. But in this particular case, we got lucky.
And the other thing is like, airplane Wi-Fi are really terrible. And make sure you don't have to deal with incidents on a plane because especially if you need to transfer like big binaries like artifacts for mobile, well, it's going to take you ages. So if you're interested in learning more about this incident, we actually published a postmortem on the React Native blog. You can read it through. It goes in more details on the technicalities of this problem. And with this, I want to thank you very much for listening.
9. Republishing to Maven and Deleting Versions
My one question is, how did you republish the Maven or did you not? We republished to Maven with new coordinates, changing the library name to React Android. We implemented aliasing to resolve requests for React Native to React Android. The process of deleting packages and publishing new versions was followed. The response from Sanatype was positive, and they found the need to delete a version interesting.
My one question is, how did you republish the Maven or did you not? No. So, yeah, we republished to Maven. Like actually now the libraries are Maven Central. And basically what we did, we used new coordinates. So the library is not called React Native anymore, it's called React Android and yeah. Yeah, that'll fix it. Yeah.
And we also implemented all the alias things. So if you in your Gradle file you request React Native, we will actually sort of pipe that request to actually resolve React Android. Yeah, yeah, because I think that's one of the fixes that it's like, it's really not possible to put it in the same place. Like I don't think there was like a quick, like it's one of those things that you didn't have a quick fix. Absolutely.
Just stare down the deep hole of Lufthansa WiFi. I was looking through the like Sanatype uses Jira to interact with customers. Yeah. I'm so sorry for that. I'm not surprised. Basically, I was looking through the requests to delete packages and they were always answering like, no, we never delete anything. Just publish a new version of the library. Like if you made a false publishing, you can bump the version. Yeah, that's also true in like NPM and also in in Crate for Rust, it's the same thing. Yeah.
And I was like, whenever I publish, I would just make this situation worse. Like because, you know, there is that version that is offending because their strength is higher than the others. So just remove it. And they were like great in doing that. I think they looked at it and they were like, you know what? I never thought I would find a reason where I needed to delete a version. Yeah, like I found one. Absolutely. They were like, their response was like, oh, that is really interesting.
10. Lessons Learned and Manual Testing
One of the lessons learned is not to reuse things from the past without fully understanding the reasons behind their use. Mentally, I feel fine now, although there was a moment of panic. We were able to fix the issue with the help of my colleagues. However, we faced challenges in patching React Native 0.63 due to deprecated software dependencies. It is important to have docker images of the build environment to avoid build failures caused by changes in tool versions. We also have a manual testing process and rely on CI tests and internal tests on the matter infrastructure.
I'm really sorry. But yeah, like they're really helpful in this. And yeah, one of the lessons we got from this is also don't attempt to reuse things that you used in the past. Like basically, I mean, in this case, we should have done more like Git archaeology to fully understand why Maven Central was used at the time. Why is it not used anymore? Like we just thought like, OK, it's there, just keep on reusing it. But yeah, like it came back.
How do you feel mentally now that it's been a while since that happened? Well, it's fine. The day that I was like, oh, my God, this is really bad. Like, you know, business class I am like I was in business. But like I got lucky and I was like, OK, let's try to fix as much as possible. And like I asked my colleague, they were like in that at times on to Andover, which yeah, I mean, we were able to fix it. But over the weekend, we stayed awake and did all the patches. And really like one problem there is like for when I was saying like to do a patch of React Native 0.63, you need Xcode 13, and then CircleCI is like, no, I remove Xcode 13. This is like deprecated software. Why do you need it? And I'm like, I need to publish a three years old library. And I need Xcode 13 because this stuff builds only with Xcode 13 and so on. So I think another lesson learned here, just more technical and on the Android side and Android and iOS, try to have docker images of your build environment, because if you build there on the CI, basically as soon as they change the Java version or the Node version or whatever version of any tool in that environment, you can't build anymore. For Android, we do have docker images so I can go down to, I don't know, old versions of React Native and say like, OK, rebuild that.
They all appeared. They all appeared now. OK. OK. OK. That's a big one. Sorry about that. So do you have a manual testing? Oh, do you have a manual testing process before releasing new versions? Yes, we do. We do have a series of CI tests, so mostly unit tests. We rely a lot on internal tests that we run on the matter infrastructure. So, for example, whenever you send a pull request against React Native, we import it internally and we run Oculus against it. We run Facebook up against it.
11. React Native Release and Optimization
When releasing React Native, we test that you can create a project and start the app immediately. Meta consumes React Native from source, while open source takes more time to build. We aim for optimal performance.
So if you broke something badly, we will discover that. But when we do a new release of React Native, we test that you are able to create a project, that the new app starts and so on and so forth, because the way a React Native is used in open source and internally at Meta is slightly different. Meta consumes React Native from source, from main. So your changes go directly into the Facebook app that goes out next week. While for open source, while building React Native takes time. You take more care. Yes, I mean, obviously, we want to make sure that with one line of code, you're able to create a new app and start it immediately without relying on build caches or so. So, yeah, we just try to make it as optimal as possible.
12. SEV Zero Incident and Reaction
Have you ever received an SEV zero? Thankfully, I had never received a SEV zero. In October 2022, the servers were down, and it was a SEV zero incident. I'm glad I work on mobile because incidents in that space are not as severe as those in production services. There was no blame culture around the incident, and the reaction was to fix it as soon as possible for the open source community. We aimed to provide the best solution and even considered creating patch versions for React Native to reach as many users as possible.
One more question, since the thing wasn't showing up, which is, have you ever received an SEV zero? And if yes, how did you, how did anyone handle it? Did anyone cry? So I had never received a SEV zero. Thanks. Thankfully. But if you recall, in I think October 2022, I believe. Oh, the servers were down or something. Yes, like everything was down. Like I think Meta cut itself out of the internet and that one was a SEV zero. Multiple people are awake and fully hands on on that. I'm glad that I work on mobile. Like I always see the Android. So like incidents are like quite mad. Like if if that starts to crash, well, it's yeah, OK, I can do a new release and put it on the store. But it will take hours before it is out, you know? So I'm glad that I'm not a production engineer. I'm not an SRE. I'm on call at times because of like open source support and things that need immediate attention or scenarios like this one. But like the gravity of the open source problems that we can have is never as high as like services that are running production.
OK, so just one more, because I feel like you gained a lot of traction. It was basically asking if anyone got mad at you. Like what was the reaction from your lead and teammates? I hope a lot of other tech companies out there, we don't have a blame culture around incidents. Incidents can happen. I actually was not in the release crew. Like I haven't sent that script, so I was not responsible. But actually, I wrote the infrastructure that caused this. So I felt responsible, obviously. The reaction was like, yeah, we broke stuff. Let's fix it as soon as possible, because we care so much about our open source community. So let's try to give the best, the best solution to them. Like when we add those snippets and patches out there, it was like, sure. But can we do better? Can we do patch versions for React Native versions out there down to the 99 percent of the users? Yes, we can. So let's do it.
13. Lessons Learned and Q&A
At the end of the day, a lot of lessons were learned. The lesson learned is to stop suppressing warnings and aim for zero warnings. If you have any more questions, there is a Q&A room available. Please be back in nine minutes for the innovation in the React panel.
So, yeah, I mean, at the end of the day, a lot of lessons learned. And yeah, like I haven't been fired because of this. And I think the last thing that I want to say is, isn't the lesson learned to actually stop suppressing warnings? Yeah, yeah. I'm a big fan of going zero warning. No more TS ignores. That could fix it. Blink twice if you're safe. Yeah, OK, sure, sure. You didn't blink, actually. It's OK, I'm safe. I'm safe. I'm safe. Awesome.
I think we're done. No, we are done. Yeah, but he didn't blink. So I'm scared now. I blinked twice. OK, cool. Trust me. OK, yeah. So if you have any more questions again, sorry for the delay that happened here afterwards. Yes, there is a little Q&A room over there. And please be back in. Oh, that doesn't have the time to do it. Nine minutes because we're going to have an innovation in react panel. So, yeah.
Comments