For example, in a previous company, we had an old legacy server which was generating PDFs, but also, and everyone had forgotten at the time, JYP lookups for security purposes. And one day, the infrastructure on which this server was became unresponsive. They started a chain reaction where user login requests, which relied on JYP, also became unresponsive. And that's when we realized that there were no defined timeouts on those calls, which, in turn, means that user login requests were stacking indefinitely, therefore, deducing our entire infrastructure. And, unfortunately, that's how we rediscovered the interdependencies of our system.
The third benefit of chaos engineering is confidence. Would you rather wait to be called at 7am on a Sunday morning because your application is down, or would you rather break it yourself on a Tuesday in the early afternoon and see that everything is working all right and that you can sleep peacefully on the weekend? And finally, it will force us to implement a disaster recovery protocol. We won't wait for an accident to happen before we think about solutions.
The thing is, in all the resources that exist on the subject of chaos engineering, books, documentation, it's all about infrastructure. So, I said to myself, why not apply it to frontend this time, because no matter how resilient your infrastructure is, how many load balancers you have, how many redundancies you have, if your frontend is broken, the users don't care. Their whole experience on your app will be negative. So, as I said, there are no resources on the subject of chaos engineering applied to frontend, nor are there any all-in-one tools or toolboxes for doing so. So, what comes next in this talk is experimental. Let's see how far we can push this subject. Let's start with the basics of chaos engineering.
It is applied in four steps. First, we'll define the nominal state of our system. For example, the user is able to log in, the user is able to watch the last season of The Witcher. Second, we'll make an hypothesis. We'll assume the continuity of the nominal state during the experiment. The user is still able to log in, the user is still able to watch the last season of The Witcher, and we will use two groups for that, control group and test group. And third, intentionally create a perturbation reproducing a real event, for example, a server crash. And finally, we'll compare the states of the two groups and we will try to disprove the hypothesis put forward earlier. Are our users still able to log in and watch the last season of The Witcher? If they don't, then we just identified a flaw in the resiliency of our system. Our aim is to apply this chaos engineering experiment to the frontend. For steps one, two, and four, it's not too different from classic chaos engineering. So we are going to look into step three and how we can create disruptions on the frontend. So there are a few areas of perturbations that we can see. The first one is HTTP requests. It can be in the form of slow 3G network, unstable Wi-Fi, unresponsive CDN, etc.
Comments