1. Introduction to Observability and OpenTelemetry
Hi, I'm Jan. Today I'm going to talk about the difference between monitoring and observability, and dive deep into OpenTelemetry and its distributed tracing. Observability is the ability to understand the system's internal state by analyzing generated data. OpenTelemetry focuses on generating and processing data, while storing and analyzing are handled by vendors. OpenTelemetry also focuses on three major telemetry data: logs, metrics, and traces. Logs consist of field names like timestamp, trace ID, span ID, and body. Metrics show CPU usage for a specific time frame.
Hi, I'm Jan. Have you ever had the situation where you got a bug ticket and it stated there was an empty screen on some customer, and you had to debug this, but there was no way of finding the root cause? Well, this shouldn't be the case. And this is why I show you how to enhance the React ecosystems with observability and how OpenTelemetry will help you in that journey.
So today I'm going to talk about the difference between monitoring and observability. I will also dive deep into OpenTelemetry and its distributed tracing, and then we'll show you a quick demo of how things work together.
So what is now the difference between monitoring and observability? Well, monitoring is the process of collecting, analyzing and using the information to track some progress, to reach the goals or to guide management decisions. So it's really stating what is happening. Just imagine Google Analytics. It shows you how many users are on this page or how many users got dropped off on another page. And based on this information, you can guide some new features. Right. So you really know what is happening. For example, if there is a bug, if there is a error showing up, it shows you what is happening, but it doesn't really tell you the root cause. So it doesn't really help you a lot.
And this is why observability is very important. So observability is the ability to understand the system's internal state by just analyzing the data it generates, such as logs, metrics and traces. So it shows you why does it behave like that. And this is very important. Traditionally, there are four major problems, generating the data, processing it, storing and analyzing. Usually this comes in one end-to-end solution. So if you want to change the store or analyze mechanism, you have to change the entire process. OpenTelemetry knows that and they actually only focusing on the first two problems, generating and processing the data. Storing and analyzing would be done on the vendors, on the vendors' shoulders. OpenTelemetry is also focusing on three major telemetry data, logs, metrics and traces.
Focusing on logs right now, these were the most important ones and the most hardest ones to implement, actually, because there were so many different implementations across all languages and OpenTelemetry is actually making one consistent semantics for all of the languages. When we talk about logs in OpenTelemetry, we actually mean log records and one log record is actually consisting of these field names like the timestamp, a trace ID, span ID and even the body, which is the most important part. I will talk more about the trace IDs and span IDs in the trace section. The metrics are a little different. Metrics are basically showing you how much, for example, a CPU usage would be for a specific time frame. Metrics in OpenTelemetry are also called meters and instruments, where one instrument would be one data point in a specific point in time.
2. Understanding Traces and Span IDs
So how much CPU does it take right now? A trace is the user journey of a specific event, like an API call. Each trace ID consists of multiple span IDs, which represent different function calls. Span events are logs that occur at specific times on the span ID.
So how much CPU does it take right now? A meter is something like a grouping, so it groups multiple instruments and you can, of course, have multiple groupers. The traces are, again, a little different. So what is a trace? A trace is basically the user journey of one specific event. For example, in this case, an API call. If somebody calls an API, you get a specific trace ID, which is only made for this specific event. And each trace ID consists of multiple span IDs. One span ID could be a database call or another function call. So this could be very random. Each span ID or trace ID can also hold key attributes, which you can just define. On top of the span IDs, there are also span events. A span event is something like a log, which just happens at a specific point in time on the span ID. I will show you later in the demo what a span event could look like.
3. Distributed Tracing in the Browser
A span event can point to an error, aiding in debugging. OpenTelemetry in the browser is experimental, with ongoing discussions on specification. Each user interaction is treated as a separate trace, facilitating debugging. Traces can be combined using key attributes, such as service instance ID.
And now it gets interesting because a span event can not just be a log, but it can also point to an error. So if something happens and you have a span event on a span ID saying, this is an error, you immediately see that through this trace, you can point that the API did something wrong. And based on what has been called, you know what is happening in the system. And this is really helpful in debugging.
So how does OpenTelemetry now work in the browser? Well, first off, OpenTelemetry is not really specified yet in the browser. So it's very experimental. And right now there are a lot of open PRs on how to specify that by the OpenTelemetry groups. So right now, how it works is the frontend tracing with, so usually you have a journey of a user of a user session, which can take three hours or four hours or even longer. And you don't want to have one trace ID for the entire journey. This would be just too hard to debug.
What you do now is that one user interaction would be one trace. So for example, if one person reloaded the page, you have one trace, which is a document load loading different files like HTML files, CSS files and so on. Over the journey, you have maybe random background polling, which creates another trace. And of course, a random user interaction. So a user click, which then directs also two different API endpoints. Sometimes you still want to combine those traces to one, and this is not done with a trace ID, but with key attributes, as I mentioned before, with, in this case, the service instance ID. This would be a unique identifier, which you just add to each of the different traces.
4. React Complexity and Server Components
You can add the service instance ID to each trace. React complexity increases with server components, especially during page loads and transitions. Max discovers the complexities of React with server components, involving Redis and Nginx.
So you can say, hey, this is the service instance ID, and this is the service instance ID. And you can add that to each of these traces. I can show you later more in demo what I mean with that.
So let's talk about a little bit of the complexity in React and Web. So specifically the complexity in React got a little bit more interesting with server components, because in the initial page load, a server component doesn't look like a server component for the user, because it's just implemented right away into the HTML. So the user now would just change to a different page, for example, the layouts page.
And this layouts page also has server components. This one is not fetched as HTML. It's actually fetched as a fetch API, post request, which makes everything a little bit more interesting to debug. So how does the complexity now look in Web? So there's Max, and Max is quite new to React, right? And soon Max finds out that React is actually a little bit more complicated with server components, since in server components, you can directly go to Redis and get some keys or you can call another endpoint, which is an Nginx, which then also consists of a Python event or a Python service, which also has access to the Redis.
5. Distributed Tracing and Next.js Demo
Max is still happy. Distributed tracing connects traces for different services using context propagation. W3C trace context allows combining trace IDs of services. The demo showcases Next.js app playground and adding OpenTelemetry to services.
Anyways, Max is a very happy engineer and wants to learn things, so no worries, Max is still happy.
So let's talk a little bit about distributed tracing. What is distributed tracing? This is basically connecting a trace for different services. So you have just a React service and an Nginx service and you want to combine them. And this works with context propagation. One context would be one service. So React service would be one context. And so is the Nginx service on the other side. This is also a context.
In between, you need to propagate the trace, which is then in total called context propagation, which is also distributed tracing. And as you can see, you see an entire trace ID over the entire span. So how does it now work that one trace ID is now consisting or including both services? Well, this is done with W3C trace context header. In this case, this is the trace parent header, which is defined by the W3C trace context. There are also different techniques like a B3 or something different. So with the W3C trace context, we have to trace parent, which consists of four main components like the first one, the 0.0, which is the version, the trace ID, which is the trace ID, which connects both of the services and the span ID, which is the last span ID of the React service, which makes the first span ID of the next service. So this is the connection between A and B. And the last component is basically if it's sampled or not. You can read up the entire spec on W3C trace context.
So let's dive in a little into the demo. I prepared the Next.js app playground, which is open source, and you can use it for now. And I also added and you can check it out the repository at the end of the presentation. One commit and one commit actually adds OpenTelemetry to our services. You can check out how OpenTelemetry was introduced here. I won't talk about the Nginx because I have an Nginx in here. I will only talk about the Next.js and the Browser.js. So instrumenting OpenTelemetry into JavaScript is a little bit more complicated and Next.js knows that. So they wrote a little helper for us. They wrote the helper register hotel, and they give us some options. For example, the service name. The service name is basically the introduction or the intended name for the service.
6. Adding OpenTelemetry to Next.js
The introduction or the identifier on how our context will be named later when we analyze the traces. Browser OpenTelemetry is experimental. Next.js does not offer browser OpenTelemetry option. Adding OpenTelemetry requires a provider, trace processing, and an exporter.
The introduction or the identifier on how our context will be named later when we analyze the traces. Also, by the way, it's still experimental. This is why it's an experimental instrumentation hook through. Also, since the browser OpenTelemetry is very experimental, Next.js is not offering that option. But I introduced myself and here you can see the bare bones of how to add OpenTelemetry. In this case, I show I give you a small introduction of the file itself. So here you have the same thing as we had before with the Next app, just with Next app browser. So in the browser, this would be our own encapsulated service. And we also have the user journey session ID. So in this case, every time when you reload, you get a dedicated UUID just for the entire session. And every trace ID gets exactly this UUID as well. On top of OpenTelemetry, just that you know, we need a provider. We need some kind of processing for traces or spans, actually. And then we need an exporter. So where does it go? There's the OTLB trace exporter, which is the OpenTelemetry protocol trace exporter. Or for example, different ones like the console exporter, when you have a. It just prints it out directly in the console.
7. Analyzing Application and Browser Traces
The application has various things going on, such as the layout page, HTML, JavaScript, SVGs, and pre-rendering for server components. Traces are sent to the OpenTelemetry collector, which is then displayed in the Jaeger UI. Context propagation occurs between the backend and frontend through trace parent ingestion in the frontend context. Browser frontend traces show fetched resources, with the longest being the browser instrumentation, which is still experimental.
So let's go into the application. When I reload now, you can see that a lot of things are going on. So there is the layout page, the HTML, which is basically really taking a long time for 400 milliseconds. And it loads as every other page, a lot of JavaScript, some SVGs and so on. Next.js is also doing some pre-rendering here for the server components, for some caching mechanisms. And most importantly, we sent them the traces directly to our OpenTelemetry collector. I will talk a little bit about the OpenTelemetry collector at the end of the presentation.
So these traces now got sent to our Jaeger UI via the OpenTelemetry collector. And Jaeger UI is basically just giving you and it just shows you the traces, right? So that's the next app browser, for example. There's the next app and the next one. I'm going to remove the text for now. We can ignore the Jaeger all-in-one. So there are three different ones which we want to focus upon. And for now, we want to dive deep into the next app browser traces. So if we now just show all the traces, we can see that a lot of things is going on. So there's one on the very first, our main car, and then a couple of other cars. So if we dive deeper into these, and this one, we can see on the tags, that it's basically really just this URL got fetched.
If you want to go and dive deeper into the main reload, which is this one, we can already see that we hit our H and X, which then already propagated the context correctly to the next app. And the next app correctly moved the context further to the browser. So how does the context propagation work between backend and frontend? Well, here you find the propagators in the browser implementation. And we also define the W3C trace context propagator, which basically takes the meter tag of the head, which is the trace parent, and ingests it into the frontend context. So this is done in the layout TSX. In the layout TSX, we can add headers or heads and everything to the base layout. And they also added the trace parent meter tag, where we have all these four components. The first one, the version, the trace ID, the last span ID from the server, because this is rendered on the server, right? And the 01, which is basically if it's sampled or not. All right, and now let's take a closer look on the browser frontend traces. And here we can already see that there were some resources fetched. The longest one, which is this one, would be the instrumentation to the browser. Currently, as I said, the browser instrumentation is not yet stable. It's still in experimental phase.
8. Analyzing Fetch Calls and Errors
OpenTelemetry was designed for Node.js but lacks some tree shaking functionalities. Page load time is affected by fetch calls, and optimizing with OpenTelemetry can help. Clicking on electronics caused a lag due to fetch calls. Multiple users generate multiple traces, and user journeys can be analyzed using session instance IDs. Errors can be triggered in the app playground.
And it was actually designed for Node.js. So all the tree shaking functionalities are not really there yet. And this is an ongoing process to make it even faster and even smaller.
So right away, you can see why this entire page load took almost 400 milliseconds. It's because one of these fetch calls. So this fetch call actually made it happen that the entire thing got longer. So if you want to make your loading times faster, with open telemetry, you see where you can optimize that. If you would have a database call in here, you would also see another trace.
So what would happen now if you click on electronics? You can see there was a little lag in there. So it took a little while. If you check the electronics part, you can see it also took 450 milliseconds. So if I check now the Jaeger UI and check all my traces, and I also remove all the noise in here, which is the 400 milliseconds, you can already see that there's one call with 417 milliseconds. If I check this one out inside the tags, you can see that there's the electronics server component. So why did this now take longer than expected? Well, we already see that there are two fetch calls right after each other. So one optimization could be already that we just paralyzed them. Or we rely on Next.js and after the second load, everything would be cached anyways.
So what would happen if there are 100 users? So if I reload now a couple of times, three, four times, then I don't have 10 traces anymore, but I have multiple ones. So when I go back to the Jaeger UI and find the traces, without the min duration, of course, then I can see that there are 20 traces. Then I can see that there are 20 traces, which are multiple sessions. So it could be multiple users. For example, if I have one person with a huge page load, I can check them out by just going to the Next App Browser. And in the process, we have basically additional tags on it, which is the session instance ID, which is the shared one for each session. And I can also filter them on tags. So I just put them in here, and I only have 12 tags. 12 traces. Now I can see the user journey for this specific person, which is quite neat if you want to analyze it and dig deeper into it. So what about errors right now? Because before I mentioned spend events, I prepared something here. So if you move the app playground, you can also trigger some errors. And in this case, I modified it that it returns a 500 because sometimes it can happen. This would be now the white page.
9. Analyzing Errors and Error Locations
Next.js error boundaries display the error component instead of a blank page. Two errors were detected in the traces, involving components and backend logs. Source maps in Grafana or DireTrace can provide detailed stack traces and error locations.
Luckily enough, again, Next.js has some error boundaries and shows the error component instead of a blank page. But still we got a 500. In the traces itself, if we want to find the traces and another session, of course, we can immediately see that there were two errors. So if we analyze it even further, we can see that there were two components who had an error. The nginx, which technically only returned a 500. So basically, it was not the fault of nginx, but it went further to the backend and the backend actually added one log. In this case, it's called logs, but it's technically a spend event. And here we already see that there is an exception going on. So an event, which is an exception, a message, this would not happen usually, and the stack trace. Unfortunately, Jager doesn't provide source web uploads, but vendors like Grafana or DireTrace do. And then you can just upload your source maps, and then you would immediately see all the stack traces and what got caught and everything. So right out of nothing, without knowing the code, you already see where the problem is, and you can pinpoint to the exact error location.
10. Traces, Storage, and Open Telemetry Collector
Adding traces everywhere may not be practical due to storage limitations. Grafana offers free storage with monthly limits, but additional gigabytes come with a cost. Mitigate this by using the open telemetry collector, which allows for processing and filtering data before exporting it to a cheap host or low-key. Open telemetry automatic instrumentation is sufficient, and context propagation is crucial for understanding service calls and reducing costs.
And this was actually everything for my demo. So let's get back to the presentation. So now you would think, all right, let's add traces everywhere, because then we know what is going on, right? Nope, you shouldn't. I mean, technically you should, but the problem is that the data needs to be stored somewhere.
There is, for example, Grafana, which is technically free forever, but you have to look closer because there are monthly limits. For example, for logs, traces and profiles, you have 50 gigabytes each for a 14 days retention. If you have multiple users and you trace everything, then you might have more than just 50 gigabytes. Well, then I just upgrade to pro and pay as you go, but I still have the 50 gigabyte traces for free, but every extra gigabyte costs now 50 cents, which is not a lot by the looks of it. But if you have a lot of gigabytes, this could add up a lot to your monthly paycheck. So here you can see the 50 cents per gigabyte.
So how can you mitigate this? There's one big thing, which is called the open telemetry collector, which is also part of my example I showed you before. So you have a couple of receivers and one of the receiver could be multiple instances. So my Next.js application, for example, would be one of these receivers because we send data directly to the collector. It could also be a file or something different. In the meantime, in the middle, you process data. So you can filter out things, you can move it to a different location and stuff like that. And this is kind of important because the exporter on the right side is kind of important because you can either send the one log line, which consists of an error could send to low key. Low key is basically the interesting part for logs on Grafana. Or you just put it to a file, which is living in a very cheap host where you can have a lot of, where you do have a lot of memory left or a lot of disk space, right? And this is how you save money. You can also add sampling in open telemetry. This is also added in one of my examples. So the key takeaways now is that open telemetry automatic instrumentation is actually doing the heavy lifting. You can still do manual instrumentation, but the automatic instrumentation is already good enough. Also, the browser open telemetry is not yet specified as very experimental. And also the context propagation is really important because you need to have the bigger picture. You need to know which service is called. And of course, the collector can reduce costs. So please use this one because it's essential. This was my talk. Thanks a lot. I'm working at DevOpsCycle as a co-founder and I'm also working in Dynatrace as a soft engineer. Thank you.
Comments