Video Summary and Transcription
Welcome to the talk, Art & Entropy, Introducing Chaos in Your Front-End. Chaos engineering is a practice invented by Netflix in 2011 to observe how a system reacts to intentional disturbance. Applying chaos engineering to the frontend is experimental but necessary, as a broken frontend can negatively impact the user experience. Intentional perturbations in the frontend can be induced through various areas such as HTTP requests with slow 3G network or unstable Wi-Fi. Tools like chaos frontend toolkits can be used to experiment with chaos engineering in the frontend and embrace breakage as part of the application's story.
1. Art & Entropy: Introducing Chaos in Your Front-End
Welcome to the talk, Art & Entropy, Introducing Chaos in Your Front-End. Have you heard of Kintsugi? It's the ancient Japanese art of repairing ceramics and porcelain using gold-spinkled lacquer. In the early days of Kintsugi, collectors would intentionally break their own ceramics and have them repaired. Kintsugi can depart from its original medium to adapt to a new sport. Is our web app perfect, bug-free, on all browsers, in all responsive sizes? No. But does it need to be perfect? No, it just needs to be resilient, to work no matter what. We are going to use chaos engineering to make sure that our apps are resilient. Chaos engineering is a practice invented by Netflix in 2011. The aim is to observe how a system reacts to intentional disturbance. The benefits of chaos engineering include discovering and identifying weaknesses in a controlled environment and understanding the interdependencies of the components of our systems.
Hi everyone, and welcome to the talk, Art & Entropy, Introducing Chaos in Your Front-End. My name is Thibaut, and you can find me on Twitter with the hero name Pseudo. Let's get started.
Have you heard of Kintsugi? It's the ancient Japanese art of repairing ceramics and porcelain using gold-spinkled lacquer. As a philosophy, Kintsugi treats breakage and repair as part of an object's history, rather than something to be disguised. In the early days of Kintsugi, around the 15th century in Japan, collectors were so fond of the rendering that they would intentionally break their own ceramics and have them repaired. Over the years, Kintsugi has evolved and blended with contemporary art. Examples include the work of Jan Forman, who uses Legos to repair walls in the urban landscape, but also the work of Raquel Susman, who uses gold lacquer to repair asphalt. In this way, Kintsugi can depart from its original medium to adapt to a new sport. A new sport, and why not the web?
Because yes, let's ask ourselves. Is our web app perfect, bug-free, on all browsers, in all responsive sizes? No. But does it need to be perfect? No, it just needs to be resilient, to work no matter what. It can be broken and functional, broken and beautiful, like ceramics repaired with Kintsugi. Sounds good. So, how do we make sure that our apps are resilient? Well, we are going to use chaos engineering to do just that. Maybe you've heard the term before. Invented by Netflix in 2011, this practice draws parallels in science with chaos theory, which is a study of the evolution of the entropy disorder in a system. The aim is to observe how a system reacts to intentional disturbance. For example, intentionally crashing a server, and watching how the infrastructure reacts. Will it pass requests onto other servers? Will it restart a server and redirect traffic? How quickly? Basically, we break stuff and then we watch. The good thing is, instead of waiting for things to blow up, to check that our infrastructure will hold up, we do it ourselves. And that's how we improve, by learning how our infrastructure reacts to chaos.
And where it reaches the next level is, at Netflix, they do that in production. They crash their infrastructure regularly, sometimes several times a day, to make sure that everything's working properly. And this has prompted them to set up highly advanced automated restoration processes, to the point where they are able to restart entire regions of the infrastructure in just a few seconds, without human intervention. These are the benefits of chaos engineering. First of all, it will help us discover and identify weaknesses in a controlled environment. We know what we broke, so we can reverse-fix it easily if things go sideways during the experiment. It will also give us an increased understanding of the interdependencies of the components of our systems. Interdependencies.
2. Applying Chaos Engineering to the Frontend
In a previous company, we discovered the interdependencies of our system when an unresponsive infrastructure caused user login requests to stack indefinitely. The benefits of chaos engineering include confidence and implementing a disaster recovery protocol. Applying chaos engineering to the frontend is experimental but necessary, as a broken frontend can negatively impact the user experience. Chaos engineering is applied in four steps: defining the nominal state, making a hypothesis, creating perturbations, and comparing states. Creating disruptions on the frontend can be done through various areas, such as HTTP requests with slow 3G network or unstable Wi-Fi.
For example, in a previous company, we had an old legacy server which was generating PDFs, but also, and everyone had forgotten at the time, JYP lookups for security purposes. And one day, the infrastructure on which this server was became unresponsive. They started a chain reaction where user login requests, which relied on JYP, also became unresponsive. And that's when we realized that there were no defined timeouts on those calls, which, in turn, means that user login requests were stacking indefinitely, therefore, deducing our entire infrastructure. And, unfortunately, that's how we rediscovered the interdependencies of our system.
The third benefit of chaos engineering is confidence. Would you rather wait to be called at 7am on a Sunday morning because your application is down, or would you rather break it yourself on a Tuesday in the early afternoon and see that everything is working all right and that you can sleep peacefully on the weekend? And finally, it will force us to implement a disaster recovery protocol. We won't wait for an accident to happen before we think about solutions.
The thing is, in all the resources that exist on the subject of chaos engineering, books, documentation, it's all about infrastructure. So, I said to myself, why not apply it to frontend this time, because no matter how resilient your infrastructure is, how many load balancers you have, how many redundancies you have, if your frontend is broken, the users don't care. Their whole experience on your app will be negative. So, as I said, there are no resources on the subject of chaos engineering applied to frontend, nor are there any all-in-one tools or toolboxes for doing so. So, what comes next in this talk is experimental. Let's see how far we can push this subject. Let's start with the basics of chaos engineering.
It is applied in four steps. First, we'll define the nominal state of our system. For example, the user is able to log in, the user is able to watch the last season of The Witcher. Second, we'll make an hypothesis. We'll assume the continuity of the nominal state during the experiment. The user is still able to log in, the user is still able to watch the last season of The Witcher, and we will use two groups for that, control group and test group. And third, intentionally create a perturbation reproducing a real event, for example, a server crash. And finally, we'll compare the states of the two groups and we will try to disprove the hypothesis put forward earlier. Are our users still able to log in and watch the last season of The Witcher? If they don't, then we just identified a flaw in the resiliency of our system. Our aim is to apply this chaos engineering experiment to the frontend. For steps one, two, and four, it's not too different from classic chaos engineering. So we are going to look into step three and how we can create disruptions on the frontend. So there are a few areas of perturbations that we can see. The first one is HTTP requests. It can be in the form of slow 3G network, unstable Wi-Fi, unresponsive CDN, etc.
3. Intentional Perturbations and Localization
GitHub is an example of an application that handles slow or no response well. Even without CSS and JavaScript, it remains functional. To induce perturbations in our app, we can add random delays and failures to HTTP requests. Localization can be challenging due to language and design differences, leading to broken elements for certain user groups.
The real world perturbation can be summarized into really slow HTTP response or just no response at all. One example of an application which handles this very well is GitHub. All other CSS and JavaScript is hosted on a CDN on GitHub.githubassets.com. We can simulate what happens if the CDN fails in Chrome by blocking all network requests to that domain. And here's what it looks like. So here I'm on the React repository and I'm looking into the different folders and files to find what I am looking for. And yeah, here I am able to find the file I want and all the code associated. And so we can see two things. First of all, well, it's not pretty. But second, it's working. We can actually go through the repository and look at the code. And this is a good example of resiliency. Even with no CSS, with no external CSS and no JavaScript, GitHub is working.
And so here is how we could intentionally induce perturbation in our app. We can proxy, xhre and fetch. This can be done in a few lines of code. For example, we add one chance out of two to add random delay and one chance out of hundred to completely fail your requests. We put that in the app and with that we'll quickly see if our app doesn't handle delay or errors well. A second area of perturbation that we can do is localization. This can be tricky to handle well because of right to left languages, latin fonts, spacing, etc. In the real world perturbation, as well as right to left languages, there are also the verbose languages. Let's take an example. I just developed a nice button. It's been approved by the designer. Let's ship it to production. But wait. Here is what Romanian users will actually see. I gave my button a fixed width and now it's broken. But the issue is I don't speak Romanian. I speak French and just enough English to give this talk.
4. Intentional App Breakage and Tools
We can intentionally break our app on localization by using pseudo localization, altering the text with more letters and characters. Timers can be perturbed by browsers throttling them, causing delays. Manipulating the navigation history can simulate users' backward and forward movements. Additional perturbations include simulating double clicks, checking for accessibility issues, and testing mobile tablet viewports. There are tools available for chaos engineering in the frontend.
So, how can we intentionally break our app on localization while still having a good developer experience? Well, we can use what is called pseudo localization. This method replaces the text with an altered version with more letters and characters while still making it readable by a human. And to do that we can use the pseudo localization npm package by TrickVGilfason that can perform it automatically on the app.
So, here is an example of what it would look like on the React website. And as we can see, the text is longer than the original with accents and other glyphs. And yet, the interface handles it pretty well. The buttons don't overflow and the menu takes the needed place to be shown.
Another area of perturbations are timers. We all assume that one second is equal to one second, right? And yet, that's not exactly true in Browserland. In some cases, the browsers can choose to throttle timers to reduce CPU and battery usage. And this means that if your app expects a set timeout to be called precisely after a specific amount of time, this timeout can be delayed for up to a minute, which may be breaking your feature. You can intentionally reproduce this perturbation by proxying set timeout and set interval to intentionally add delay to the timers. If your app doesn't handle this, you will quickly notice the issue.
So, here in this code, there is one chance out of two to add 500 milliseconds or remove minus 500 milliseconds. And the fourth area is the history. We often see the user's journey as a linear and unidirectional path interval app. But we often forget that it is quite frequent for the users to travel back and forth in the history to perform actions. And it can also happen by accident. And there is nothing quite as frustrating as losing your data in a big form when that happens. Again, we can manually create this perturbation with a couple lines of codes. Here, every minute, there is one chance out of 100 to randomly go back or forth in the navigation history.
So, those are the four areas of perturbation. But there can be even more. Why not simulate double clicks for every clicks of our users? Why not turn our app in black and white with a single line of CSS to check for accessibility issues? Or let's go crazy. Why not force the mobile tablet viewports, even when on desktop, to make sure that our apps are working there as well? So, yeah. By now, you must be looking at me like that. What we've seen looks interesting, but you may be lost on how to actually do that in your web apps. The good news is I lied. Earlier in this talk, I said that there was no all-in-one tools for chaos engineering in the frontend. But that's not true.
5. Chaos Frontend Toolkits and Embracing Breakage
I created chaos frontend toolkits, a browser extension and NPM library, to experiment with chaos engineering. Control and balance are essential in chaos engineering, setting boundaries and ensuring the right level of disruption. Implementing chaos engineering in production may not be feasible for the frontend, but it can be applied in test and staging environments. Let's embrace breakage and reburn as part of our application's story.
Because I want you to experiment with chaos engineering, I created chaos frontend toolkits, which is a browser extension and NPM library that includes all the previous areas of perturbation I told previously. You will be able to simulate double clicks and many more experiments with a single click.
And now, that's how I see you. You're ready to go back to your companies and break everything. But before I leave you, I want to ask you to listen to the words of Tessia De Vries. Magic is organizing chaos. And while oceans of mystery remain, we have deduced that this requires two things. Balance and control. Without them, chaos will kill you. And as Tessia De Vries says very well, magic is organizing chaos. This requires two things, balance and control.
First of all, control. Don't forget that chaos engineering is a method of experimentation. We need to set up boundaries for this experiment on a given system and for a given time, respecting the four steps, nominal state, hypothesis, perturbation, comparison. And finally, balance. Chaos must be sufficiently present to test the system resilience, but not too present as it may disturb users.
And here, we come to what I think is a limit of chaos engineering applied to the frontend. In the chaos engineering experiments carried out by Netflix in production, disruptions are almost invisible to the users because everything happens in the backend. But on the front, the user will inevitably be affected by the disruptions that I mentioned earlier. So, I think it's impossible to implement it in production. In the best case scenario, it could be applied to test and staging environments. But given that chaos engineering applied to the frontend has never been done before, why not be a precursor and see what can be done with it? Let's apply the Kintsugi philosophy. Let's treat breakage and reburn as part of the story of our application rather than something to be disguised.
This is the end of the presentation. Thank you for listening.
Comments