English versionEN

Observability for React Developers

Observability is the ability to measure the current state of a system. Backend engineers are becoming more familiar with the 3 pillars of observability, and technologies such as OpenTelemetry that can be used to instrument applications and diagnose issues. Yet in the frontend world, we're behind the curve.
Join me as I dive into the tools and techniques we can use to instrument, monitor and diagnose issues in our production React applications. I'll cover RUM agents and React framework extensions, and the metrics and traces they provide, how to combine them with backend tracing for a holistic picture, and how Synthetic Monitoring and alerting in Observability platforms can help us be alerted to issues impacting users in the UIs we build and maintain.

This talk has been presented at React Advanced 2024, check out the latest edition of this React Conference.

FAQ

Carly Richmond's talk focuses on observability for both front-end and back-end services in web applications, particularly for React applications.

The tools discussed include Rome agents, open telemetry, and synthetic monitoring.

The three types of telemetry information are logs, metrics, and traces.

A RUM (Real User Monitoring) agent is a tool used within JavaScript applications to collect metrics, diagnostic information, and tracing data to identify bottlenecks and performance issues in the front-end.

Open telemetry is recommended because it offers an open standard for telemetry data collection that is compatible with various vendor observability platforms, allowing for comprehensive front-to-back tracing.

Synthetic monitoring involves running scripts at fixed intervals to simulate user interactions or ping endpoints, helping catch issues before users encounter them. Tools like Playwright for JavaScript can be used for this purpose.

Synthetic monitoring can be integrated into the CI pipeline by using tools like GitHub Actions to run monitors as code during development and deployment stages, ensuring they are in sync with application changes.

Playwright allows for end-to-end testing by automating user interactions and can be used to simulate user behavior on different devices, providing insights into application performance.

Logs, metrics, and traces help React developers by providing insights into application behavior, identifying bottlenecks, and diagnosing performance issues in both front-end and back-end services.

Code examples for RUM and synthetic monitoring can be found in the resources mentioned in Carly Richmond's talk.

observability

Carly Richmond

20 min

28 Oct, 2024

Comments

Video Summary and Transcription

In this talk, observability in React web applications is discussed, covering Rome agents, open telemetry, and synthetic monitoring. The importance of logs, metrics, and traces in tracking application behavior and identifying issues is explained. The concept of a real user monitoring agent (rum agent) for JavaScript applications is introduced. Instrumenting the rum agent and React applications is explained, along with the caution to be mindful of the bundle size. Traces, open telemetry, and instrumentation are explored, including how Core Web Vitals, traces, and open telemetry can provide insights and enable auto-instrumentation. Synthetic monitoring using Playwright tests and the process of converting tests into monitors with configuration is covered. Finally, running and monitoring definitions in production is discussed, highlighting the ability to evaluate user behavior, simulate activities, and address issues proactively.

Available in Español: Observabilidad para Desarrolladores de React

1. Observability in React Web Applications

Short description:

In this talk, I will discuss observability in React web applications, including both front-end and back-end services. We'll cover Rome agents, open telemetry, and synthetic monitoring. The three pillars of observability are logs, metrics, and traces. I'll explain each of them and their importance in tracking application behavior and identifying issues. Additionally, I'll introduce the concept of a real user monitoring agent (rum agent) for JavaScript applications, which helps in identifying bottlenecks and optimizing React applications.

Hi, React Advanced London. Welcome to Remote Day. It's great to see you all. My name is Carly Richmond, and I'm here today to talk about observability, but not as you may traditionally think about it. When we think about observability, we focus too heavily often on back-end services and what they're doing, when we need to understand the behavior of our web applications in the front-end as well. So we're going to talk about how we do that with React web applications in addition to seeing what's happening in the back-end services, too.

The tools we're going to cover are Rome agents, open telemetry, and synthetic monitoring. I'll show you some examples. We'll have snippets you can dive into afterwards if you want to have another look. But if also you want to ask me any questions afterwards, feel free to contact me online and on various places on the interwebs. I'm a principal developer advocate and manager at Elastic, so you'll be able to find me easily on LinkedIn, too. Before that, I was actually a front-end engineer for ten years working at a large investment bank, so this is problems I've dealt with firsthand, which means I've got a lot of opinions on how this should be done.

But first, we need to talk about pillars. Quite often when you start diving into observability, people start saying, oh, well, we need to talk about the pillars. Let's talk about them as signals, because that's effectively what they are. These are three types of telemetry information that we can use in our applications to figure out what's going on and identify behaviors and remediate issues that we might not necessarily anticipate are going to happen. So the first signal we tend to talk about is logs. And logs, we all know what they are. They're those wacky messages that we see that we're putting into the browser console that we can see when we're diagnosing issues in our application. The second is metrics, which are simply values that give us a rough indication of performance that we can use to track over time. So you might think of throughput and latency, but we also have Google Core Web Vitals, which are a good indication for user performance, at least as a rough guide. And then we have traces. Now, you might think, well, what are these? But actually, you've been using these for a while and not necessarily thinking about it. These are the simple kind of bars with underlying spans you can see up in here that we've seen for showing the amount of time it takes to go through different stages in our application, or even for seeing how long particular network calls are taking to come back, which we're all used to when we're diagnosing issues with React applications trying to connect to back end services and serverless functions.

But we need to crack out the rum. And sadly, I'm not talking about Jack's favorite drink here. I'm talking about a real user monitoring agent. This is a simple agent that you initialize within your JavaScript application that's going to collect metrics and diagnostic information along with tracing in the front end application. And that means that we can use it to identify bottlenecks, see if we've got particular scripts that are taking a long time to load, and other pieces of useful information when we're trying to figure out what the bottleneck is within our React application. I'm talking about a rum agent because at this moment in time, unless someone wants to correct me, we don't have anything that's generic enough in cross vendor.

2. Instrumenting the RUM Agent and React Applications

Short description:

To instrument the rum agent, you can use a script tag, although it's not recommended for older HTML or React applications. It's better to install the APM rum extension using MPM and add the necessary options. The agent requires information such as service name, distributed tracing origins, server URL, and optional attributes like service version and environment. For React applications, custom framework integrations may be necessary, such as the APM run react extension. However, be cautious about the bundle size, as adding agents can increase it significantly.

You need to use the rum agent that's associated with the observability platform that you're going to be ingesting data into. However, keep your eyes peeled on the client instrumentation group for open telemetry. If you want to join the SIG, the details are there on their site as well. And that's really the group that's looking to try and have more of an open standard that's not so vendor locked in. But for the we're going to use the elastic one.

So there's two ways to instrument them. And this tends to be similar patterns across different agents. So you can either use a script tag. Generally, I don't recommend this in an HTML application or in a React app that perhaps is old that you're not touching anymore and you just want to have some basic telemetry included. So you just include the script tag. But generally, the pattern I would recommend is that you install the APM rum extension, sorry, elastic APM rum extension using MPM and then you use the inits and add in the appropriate options.

So the things you need to tell the agent that it's going to send that basically going to categorize the signals that your application is going to send is service name. If you're working on loads of different applications, which was certainly my experience, you need to know which application the signals are corresponding to. And also, if it's the React front end, or those back end services, when we get to front to back tracing. You then got distributed tracing origins, this is needed because by default, the rum agent operates on the same origin policy. So you need to make sure that it's able to add the trace parent header to those HTTP requests going back to back end services, so that you can actually see the full trace, which we'll see later. Then got the server URL, this is the L deployment you're gonna be sending to. And then you've got optional attributes of service version and environment, which well not necessary, you can use them to make sure that maybe you want to compare errors, see how many service versions it goes back to try and identify the source of the problem. Or even if you're doing environmental comparisons, these attributes can be useful to add in.

But we also need to think about the React element because different SPA frameworks behave differently. And sometimes we may need to add particular custom framework integrations alongside, to make sure that we get the appropriate telemetry for our application. And in this particular example, you use the APM run react extension, which is another MPM install. And you use that to access the APM routes component that is then wrapped alongside your react route and your DOM routes. This is as of version six. So if you're using earlier versions, please make sure that you're using the right pattern. It's all covered in the documentation.

Now, just a warning, we need to think about the types of agents that we're adding into our applications. Because if we're not careful, and we're not using the appropriate optimizations, they are going to bloat the build, there's no way around it, doesn't matter if we're using a rum agent, Google Analytics, Hotjar, or something else, these things can really ramp up the bundle size. So make sure you're also following the appropriate instructions for the bundler you're using to make sure you optimize the production version as well. But getting back to what these things give us, they give us all sorts of different metrics are captured.

3. Traces, Open Telemetry, and Instrumentation

Short description:

In Elastic, Core Web Vitals provide insights based on live user traffic, capturing actual user actions. Traces allow for tracking the loading of bundles, HTML invocations, and asset loading, helping identify bottlenecks. Open telemetry enables connecting the backend service to the underlying telemetry. Specify attributes, endpoint, bearer token, and exporter protocols for auto-instrumentation. Node resource detectors allow for selective metric sending. Use node options for auto-instrumentation or the manual approach for custom spans and transactions.

So here in Elastic, you can see I've got Core Web Vitals. And the nice thing about this is unlike when we use tools such as Google Lighthouse or something else where we're just running a generic report, this is based on what a user's actually doing in your application is captured on the live user traffic. It's not captured as a synthetic event as you would have with a Lighthouse report. You've also got static statistics on page loads, you've got agent information so you can see perhaps maybe a user's encountering an error in a specific browser and use all of that kind of information to really find out what's going on as a user's using your React application.

But furthermore, we've also got the notion of traces. So remember I talked about these spans that we kind of knew about but maybe weren't too sure what they actually were? This is what I'm talking about. I'm talking about being able to see the loading of particular bundles, invoking of HTML or other calls, and also loading other assets such as CSS so you can identify where the potential bottlenecks are. But we need more than that because we want the underlying telemetry to connect to the backend service. And this is important for me personally because a number of times I had to deal with an issue only to find out that the issue was actually on the backend service. It was surprisingly high back when I was an engineer. So it's important to have both sides of the coin.

For that, I'm going to recommend using open telemetry so you can use that open standard and use the open telemetry protocol with any vendor compliant observability platform. So if we take Node.js as an example, keeping with the JavaScript ecosystem, what you can see is I can actually auto instrument. I have the manual approach coming up. But what you need to do is specify either the optional attributes under OTL resource attributes environment variable. You've got your service name. You've then got the end point, which kind of deployment are you sending it to. Or which in fact particular observability platform generally you're sending that's compliant with open telemetry. I then need to specify my bearer token in the headers for authorization, which is a step I didn't need with the Rome agent. And then I've got to specify the exporter protocols, which is OTLP. And I have to specify that for traces, metrics, and logs, just taking a warning that logs is currently under development. So you might want to consider that carefully, but it is coming in the future. We also should talk about node resource detectors, which is that one in line number seven. This is if you want to be very specific on which metrics you're going to be sending, because it might be that you're thinking about storage costs and you don't want to send everything, which by default it does if you don't specify it anything. And then to auto instrument, use node options to specify that you're going to register the open telemetry and using the require option, which you can obviously add on to the end of the run command if you so choose, or you can use node options as I've got here. And that will mean it's also going to attach the open telemetry to the running process and collect all the basic telemetry data for you without you having to change the code. However, you might have situations where you need to maybe add custom spans or transactions or additional metadata that you want to see coming through. And in that particular instance, you can use the manual approach, which means you need to initialize the converter, initialize those exporter formats manually, and then you can add in additional transactions using the API from open telemetry. And that means that we'll be able to see not just where our individual back end service call is, but you'll also be able to see what service is coming from. So if you're like me, who has all sorts of microservices all over the place, you'll be able to see exactly which service it's going to, whether it's entering or not, and all this other useful metadata as well.

4. Synthetic Monitoring and Playwright Tests

Short description:

Synthetic monitoring involves running a script on a fixed frequency to ping an endpoint or automate user interactions on a React website. Playwright for JavaScript is recommended for end-to-end testing, along with running monitors as code using GitHub Actions. Push monitors at deployment to keep them in sync with the application. Use API key authentication for pushing monitors and schedule their execution via Kubernetes infrastructure. Playwright tests use the synthetics operator and can utilize the get by test ID shorthand to decouple styling from testing logic. Fill in credentials and check for enabled submit button in the test.

So that's how we use RUM and OTEL together, but I talked about synthetic monitoring as well, and you might be wondering what that is. So synthetic monitoring is when we have a script that runs on a fixed frequency, maybe every 10 minutes, an hour, something like that, that is going to either ping a dedicated end point in our application ecosystem or is going to automate user interactions such as clicks, text entry and other behaviors against our React website in order to make sure that it's alive and well. And the hope is that you can hopefully catch issues before a user does be alerted to it.

So I'm going to use Playwright for JavaScript, which is an end-to-end testing framework maintained by Microsoft that is used within the Elastic Ecosystem. If you're using a different provider, Selenium is often another alternative, so make sure you look at whichever provider is associated with whichever automation framework to allow you to do this. And then you also will need to run it as part of your CI as well. We want to make sure we're running the monitor as code at earlier stages to potentially catch defects and also use it as a test, which is why we're going to use GitHub Actions as well. But first, let me explain how these pieces fit together in a little bit more detail before we dive into the code.

So the first thing we need to think about is having JavaScript or TypeScript definitions that have our Playwright test wrapped in the synthetics operator. From there, it will be run as an end-to-end test in GitHub Actions under peer review or under local development when you've basically got a little running start up that allows you to basically test as you go along and undertake test-driven development while you're building out the feature. And then when it comes to deployment time, you push your monitors at the same point to make sure that your application and your monitors are in sync. The last thing you want is failing monitors because they're basically not taking into account changes you've made. And you would use API key authentication to push that to where it's running. And then the monitors are going to run on a fixed schedule scheduled via the underlying Kubernetes infrastructure to push to the production web app and then store the telemetry results of each execution into Elasticsearch.

Here's a test. This is a Playwright test. And for anyone who's not familiar with Playwrights, the key actions I've got here, as you can see on line 4, I'm saying I'm going to go to the login page. And then from there, on line 6 and 7, I'm pulling out my login form and checking it's present. I'm not using CSS locators here on purpose. I'm using the get by test ID shorthand because this means it will tie to the data test ID attribute on the HTML element on the page. That way, you decouple your styling from your testing location and logic. If you use, in my experience, the CSS selectors and then someone starts changing styles, perhaps you want to change the look or the appearance, depending on the selectors you choose, you end up with flaky tests that change just because you moved an element under another child. So try and use test ID, please. So from there, we need to use the submit button. We're checking again, seeing if it's disabled because I don't want a user to enter empty credentials in a manual flow. And from there, I can add in the credentials using fill. Now, these are dummy values on purpose. Please don't use legitimate production credentials in this case. And then I'm going to fill in the appropriate input boxes with the username, password, value, then going to line 25. I'm checking my submit button is now enabled.

5. Converting Test into Monitor with Configuration

Short description:

To convert the test into a monitor, we use configuration and the init wizard to generate a scaffolding project. The environment configuration allows for specific parameters, including overriding the URL. Playwright options provide additional configuration, such as device emulation. The default monitor configuration sets the frequency and infrastructure. Splitting the test into a journey of logical steps allows for better error identification. The code remains unchanged, and it can be run locally or within a CI pipeline.

And then when I click it, I'm making sure it navigates to the order page so we can start ordering items from my menu. So to convert this into a monitor, we need configuration. And the easiest way to do that is actually to use the init wizard, which generates a scaffolding project for you. So I've got examples to get started. I've got lightweight monitors, which are heartbeats, and I'm not talking about those today. And then I've also got the package.json, which generates the commands to run the tests and also push the monitors. And then I've got it as a configuration that I can use to specify what's going on with my individual monitors come in 10 tests. So this is what the environment configuration looks like. So you'll see that it's pushing in the node environment, and that allows me to specify parameters that are specific to that environment. So you'll see on line 6, I'm specifying the URL as localhost. But then if we skip all the way down to lines 33 to 35, you'll see that actually I'm able to override that particular value with the production URL. And this means we can use the same journey definition in two places. I've got playwright options on line number 10. You would use these for adding in playwright-specific configuration and options. So a good example would be if you want to do device emulation. Maybe you want to emulate what's going to happen on an Android versus an iOS device, and you would add the appropriate supported devices as per the playwright documentation. You'll then see in line 17 to 21, this is the default monitor configuration saying it's going to run every 10 minutes, and I'm using the Elastic UK infra. I'm not running it in my own setup location, but I can do that by utilizing fleet and installing that wherever I need to install it. And then the project settings are basically where's the elk stack that I want to send this to. So that configuration then gets passed into a journey alongside the playwright page object. So this is what we need to do to change that test we had into a monitor. We just need to split it up into a journey of more logical steps. Firstly, this works well because it allows you to utilize patterns such as behavior-driven development where you're specifying the user workflow in English-like language that you can apply to individual steps. And also, it means that when we get to the monitor side, you have a better idea of precisely which step has failed in a longer process, because remember, these user journeys can probably get quite long and complicated. So I'm setting up by going to the URL that I'm passing in parameters and the before step. And from there, I'm just wrapping my login and my manual steps that I was doing before in the one test and splitting them out. I've not changed any of the playwright code that I've got at all. So obviously, I'm going to run it locally. I'm also going to run it within my CI pipeline. So for running it as a test, I'll specify the node environments development, still need my credentials.

6. Running and Monitoring Definitions in Production

Short description:

I run the same definitions on production, monitoring their performance and identifying any failing steps. By pushing the definitions when deploying the application, I can see the length of each step, identify errors, and even monitor core web vitals. Alerts can be raised for frequent failures, enabling the retrieval of valuable information. With logs, metrics, and traces, React developers can evaluate user behavior, simulate activities, and address issues proactively.

And then I'm going to run the actual command to run the same definitions as I did on local development from within the journeys directory. And I'm publishing using a JUnit reporter, but there's a build kite one as well in case anyone's using build kite.

Then when it comes to production time, I need to push these. I need to make sure that I'm pushing the definitions when I'm deploying my application. And when I do that, I end up with a set of monitors that will run on the schedules I've defined either within the synthetic settings or alternatively on the individual monitor. And I can see which particular steps are passing and failing. I'm able to see how long these are taking, which for me when I was writing end-to-end test was super important because if they were getting longer and longer, it suggests that maybe the performance of my application needed to be investigated to see why. But then I can also see the underlying steps as well. I could see how long each individual step is as you can see from my right. And I can also see exactly which error is happening, which step is failing. And I've even got those core web vitals as well that we have to caveat. These are more synthetic than the real user monitoring ones, but certainly you can give you an idea, maybe an individual step is perhaps not quite in line with your core web vitals as you would so like.

And then we can start to do smarter things. We can raise alerts if these things start to fail so many times, we can start asking, you know, assistant to chat to PT, what errors mean and then try and use that to get information. So, I've talked about logs, metrics and traces and I've talked about how they relate to us as React developers and I've also talked about the tools that we use for that. How we use synthetic monitoring and Raman open telemetry together not just to evaluate user behaviour and traffic and gain insights into bottlenecks and performance, but also how we can try and simulate some user activities and catch the issues before they do. If you want to check out the code examples, you've got the RUM one on the left and you've got synthetic monitoring on the right. If you have any questions, feel free to reach out and React Advanced London. It's been a pleasure. Thank you.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Multithreaded Logging with Pino

JSNation Live 2021

19 min

Multithreaded Logging with Pino

Workshops on related topic

Scaling Databases For Global Serverless Applications

Node Congress 2022

83 min

Scaling Databases For Global Serverless Applications

Workshop

Ben Hagan

This workshop discusses the challenges Enterprises are facing when scaling the data tier to support multi-region deployments and serverless environments. Serverless edge functions and lightweight container orchestration enables applications and business logic to be easily deployed globally, often leaving the database as the latency and scaling bottleneck.
Join us to understand how PolyScale.ai solves these scaling challenges intelligently caching database data at the edge, without sacrificing transactionality or consistency. Get hands on with PolyScale for implementation, query observability and global latency testing with edge functions.
Table of contents- Introduction to PolyScale.ai- Enterprise Data Gravity- Why data scaling is hard- Options for Scaling the data tier- Database Observability- Cache Management AI- Hands on with PolyScale.ai

node.js observability serverless enterprise