Deconstructing Distributed Tracing

Rate this content
Bookmark
Slides

Distributed tracing is a powerful technique that allows you to track the flow and timing of requests as they navigate through a system. By linking operations and requests between multiple services, distributed tracing provides valuable insights into app performance and helps identify bottlenecks. In this talk Lazar will explain the concept of Distributed Tracing by walking you through how monitoring tools build tracing solutions.

This talk has been presented at React Day Berlin 2023, check out the latest edition of this React Conference.

Watch video on a separate page

FAQ

Distributed tracing was developed as a response to the limitations of traditional debugging tools like log files, which became insufficient as software architectures evolved into more complex, asynchronous, and distributed systems.

Distributed tracing works by creating a 'trace' for each request, which follows the request through the system and captures data about various operations or 'spans'. Each span records information such as start time, end time, and parent-child relationships among spans.

The key components of a distributed tracing system include traces, spans, and trace context. Traces represent the entire operation flow, spans represent individual units of work, and the trace context helps in linking spans across different services or containers.

Distributed tracing improves debugging by providing a detailed and structured view of the operations across different services and machines. It allows developers to easily identify performance issues and understand complex interactions within their applications.

In distributed tracing, spans are the fundamental units that describe specific operations, such as an HTTP request or a function call. Spans can create child spans, forming a hierarchical structure that mirrors the application's operations.

A trace context in distributed tracing is a mechanism that concatenates the trace ID and the ID of the last span into a string. This string can be transferred across different backends or processing units to continue the trace seamlessly.

As software architectures evolved into using microservices, asynchronous programming, and containerization, traditional debugging methods became inadequate. Distributed tracing emerged as a necessary tool to handle the complexity and distributed nature of modern applications.

Distributed tracing is a technique used to track the flow and timing of requests and operations within a system, particularly useful in full stack and microservice applications. It helps in understanding system performance and identifying bottlenecks.

Lazar Nikolov
Lazar Nikolov
8 min
12 Dec, 2023

Comments

Sign in or register to post your comment.

Video Summary and Transcription

Distributed tracing is a powerful technique for tracking requests and operations in a system, especially in full stack and microservice applications. The reinvention of distributed tracing introduces the concept of a trace and spans to capture debugging data. Enhancements include tags and a status field for better analysis, and the distribution of traces using a trace context for continued tracing.

1. Introduction to Distributed Tracing

Short description:

Distributed tracing is a powerful technique that helps track the flow and timing of requests and operations in a system. It is especially useful for full stack and microservice applications, allowing for better understanding of system performance and identification of bottlenecks. The technique has been around since the early 2000s but gained popularity in the 2010s. As libraries and frameworks evolved, so did debugging tools, from logs in Apache Server to handling multiple requests in a single process with separate threads. With advanced concurrency, frameworks like Node.js allow requests to start and finish in different threads.

♪ ♪ Reconstructing distributed tracing. Hello, everyone. My name is Laza Nikolov, and I am a developer advocate at Sentry. Today on my talk, we're going to talk about distributed tracing. First explain what it is. Then we're going to get into a little history on the debugging tools to find out why distributed tracing existed in the first place. And then in order to understand it better, we're going to rebuild distributed tracing from scratch or at least just the concept of it.

All right, so let's dive in. Distributed tracing is a powerful technique that allows you to track the flow and timing of requests and operations as they flow through your system. This is especially useful for full stack and for microservice applications. Distributed tracing helps you understand the performance of the system and also identify any bottlenecks. It's especially useful for debugging complex and weird bugs like race condition bugs that require a lot more than just a console lock and a stack trace. It's not new by any means. There are white papers mentioning tracing since the early 2000s, but it got popularized during the 2010s. So to understand why it exists, we need to go back in time.

As our libraries and frameworks evolved, so did our debugging tools. For example, back in the early days of Apache Server, logs were one of the few methods for debugging. As requests arrived, Apache forked a child process and handled the requests. If you wanted to debug what happened during that specific request, you could just pull the process's logs and you'll see the whole operation flow. And that worked. We were happy. Then we got basic concurrency. Think of IIS in ASP.NET. Instead of forking a process for every request, we started handling multiple requests in a single process, but in a separate thread. Logs are still a good debugging method, but to isolate the request's logs, we need to prefix them with the thread name and then filter the log messages based on it. Not a big deal, but we made it work. Then we got advanced concurrency. Our frameworks evolved into async, multithreaded, futures and promises, event loop-based frameworks. This is Node.js. So now our request can start at one thread, but finish at a different one, going through many other threads along the way.

2. Reinventing Distributed Tracing

Short description:

Prefixing logs with a unique ID for each request no longer solves the problem in a distributed system. With the rise of containerized services, backends are spread across multiple machines, making it difficult to trace operations. To address this, we reinvented distributed tracing from scratch. We introduced the concept of a trace, which follows a request and captures debugging data. Within the trace, we have spans that represent the smallest unit of work, such as an HTTP request or a function call. Spans can create child spans, allowing us to mirror the structure of our software. Each span has a unique ID and holds data like its parent ID.

Prefixing them with the thread name doesn't really solve our problem now. We need to prefix them with something unique to the request itself, and that's what we did. We generated a unique ID for each request and prefixed it, our logs.

But our frameworks didn't stop evolving. About 10 years ago, Docker and AWS made way for containerized services. And now our backends don't even live on one single machine. Each container and microservice handled multiple requests and produced its own logs. Our logs are all over the place now. It was very hard to make sense of the operation flow, so we needed a better debugging tool that can trace the operations as they jump between containers and services. That's when distributed tracing became a necessary tool for debugging.

In order to understand how it works, we're going to reinvent it from scratch. Since our backends now have a very distributed nature, we needed to define a vehicle for each request that will follow it around and capture debugging data along the way. Let's call that a trace. The trace will start when the operation flow starts, and it's going to have a unique ID. That can be the frontend, for example.

If we think about logs, they usually tell us what happened at a particular time. They try to mimic the structure of our code. So let's invent that now. Let's invent something that's going to describe the smallest unit of work, like an HTTP request or a function call or anything specific that our software does at a specific time. We're going to call that a span, and we're going to create one immediately when the trace starts. That's going to be our root span.

So just like the log, the spans are going to mimic the structure of our software. But since we're reinventing it, let's make it much smarter than simple messages. So since spans are the smallest unit of work, like a single function, and we know that one function can invoke another function, which in turn can also invoke a third function, we're going to design our spans so they can create child spans, which can to create their own child spans and so on. Now we can really mirror the structure of our software with this. We have a span hierarchy, but we need to remember which span is a child of which span. To do that, we're going to need something to identify each span. So we will assign an ID to each span as we create them. We also need to save the parent span ID. So let's create a space inside each span so it can hold data like its ID and its parent ID.