Serverless Observability: Where SLOs Meet Transforms

Rate this content
Bookmark

This talk explores the use case of SLOs and transforms whilst migration to the serverless ecosystem. The talk starts by presenting the reasons why SLOs are important in the SRE/DevOps framework. It then analyses specific use cases of SLOs, the tools used for measuring the efficiency of SLOs and presents the main bottlenecks encountered when defining and adhering to SLOs in the process of migrating to a serverless ecosystem, especially when dealing with burn rates and transforms.

By the end of the talk, the audience will be able to subscribe to the following takeaways:

SLOs are important in a SRE/DevOps framework and there are numerous advantages for implementing them as long as they adhere to a process of continuous improvement.

Adopting the right tools and metrics are paramount in the implementation of SLOs.

Migrating to serverless adds pressure to the observability system and it is possible that burn rates and transforms will backfire. Our use case will show how it's possible to mitigate these challenges and learn from similar situations.

This talk has been presented at DevOps.js Conf 2024, check out the latest edition of this Tech Conference.

FAQ

Server level indicators, or SLIs, are measures of the service level provided, typically defined as a ratio of good events over total events, ranging from 0 to 100%. Common examples include availability, throughput, request latency, and error rates.

A Service Level Objective (SLO) is a target value for a service level measured by an SLI. It defines the performance threshold for service compliance, such as ensuring 95% of requests are served within 100 milliseconds.

Burn rate alerting calculates the rate at which SLOs are failing over multiple time windows. It focuses on sustained deviations, helping to reduce alert fatigue and indicate the severity of service degradation.

Transforms are persistent tasks in Elasticsearch that allow conversion of existing indices into summarized indices. These transforms enable aggregation and pivoting of data, facilitating insights and analytics on entity-centric indices.

The SLO transform architecture relies on transforms to roll up source data into roll-up indices, which are then summarized into an entity-centric index for each SLO. This structure supports features like group by or partition by, enhancing search and sort capabilities in SLO dimensions.

To create an SLO in Elastic's serverless environment, navigate to observability, select SLOs, and choose to create a new SLO. You define the type, index, time field, and queries for good and total events. Objectives, time windows, and burn rate rules are set to monitor and alert based on specific performance thresholds.

A good SLO is specific, measurable, user-centric, quantifiable, and achievable with a defined time frame. In contrast, a bad SLO is vague, lacks quantifiable metrics, has an undefined threshold, and no observation window, making it ineffective for monitoring service reliability.

Virginia Diana Todea
Virginia Diana Todea
8 min
15 Feb, 2024

Comments

Sign in or register to post your comment.

Video Summary and Transcription

This Talk provides an introduction to Serverless Observability and SLOs, explaining the concept of SLOs and their dependency on transforms. It highlights the codependency between SLOs, SLAs, and SLIs and discusses the importance of well-defined SLOs. The Talk also demonstrates how to create and monitor SLOs and alert rules, emphasizing the benefits of burn rate alerting in reducing alert fatigue and improving user experience.

1. Introduction to Serverless Observability and SLOs

Short description:

Hi, I'm Diana Toda. I'm here to present Serverless Observability where SLOs meet transforms. We'll discuss the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and have a short demo. Server level indicators are a measure of the service level, defined as a ratio of goods over total events. The service level objectives are the target values for a service level, and the error budget is the tolerated quantity of errors.

Hi, DevOps.js. I'm Diana Toda. I'm an SRE at Elastic, and I'm here to present Serverless Observability where SLOs meet transforms. So we're going to talk about the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and we're going to have a short demo.

So a bit of context. With Elastic's migration to serverless, we had the need to come up with a new idea around the rollup aggregations. So Elastic has a multi-cluster infrastructure, and we needed to move away from rollup aggregations and search due to some of their limitations. Then we started creating the transforms.

So let's start with some definitions. Server level indicators, as probably you well know, are a measure of the service level provided. They are usually defined as a ratio of goods over total events, and they range between 0 and 100%. Some examples, availability, throughput, request latency, error rates. The service level objectives are a target value for a service level measured by an SLI. Above the threshold, the service is compliant. For example, 95% of the successful requests are served under 100 milliseconds. The error budget is defined as 100% minus the SLO. So it's the quantity of errors that is tolerated, and the burn rate is the rate at which we are burning the error budget over a defined period of time. It's very useful at alerting before exhausting the error budget.

2. Codependency Between SLOs, SLAs, and SLIs

Short description:

So we have a codependency between SLOs, SLAs, and SLIs. How do we recognize the good SLO versus a bad SLO? A well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations. The SLO architecture relies on transforms to roll up the source data and summarize it into entity-centric indices. Transforms enable you to convert existing indices, providing new insights and analytics. Burn rate alerting calculates the rate at which SLOs are failing over time, helping prioritize issues. It has reduced alert fatigue, improved user experience, and good precision. Let's move on to the demo where you can create and monitor SLOs.

So we have a codependency between SLOs, SLAs, and SLIs. So how do we recognize the good SLO versus a bad SLO? A bad SLO is vague, subjective, it lacks quantifiable metrics, it has an undefined threshold, and no observation window. A good SLO is specific and measurable, user-centric, quantifiable, and achievable, and it's time frame defined. So a well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations, which are essential elements for effective monitoring and evaluation of service reliability.

The SLO architecture, basically the SLOs rely on the transform surface to roll up the source data into roll-up indices. To support a group by or the partition by feature, Elastic has added a second layer which summarizes the roll-up data into an entity-centric index for each SLO. This index also powers the search experience to allow users to search and sort by in any SLO dimension. So what are transforms? Transforms are persistent tasks that enable you to convert existing Elastic search indices into summarized indices, which provide opportunities for new insights and analytics. For example, you can use transforms to pivot your data into entity-centric indices that summarize the behavior of users or sessions or other entities in your data. Or you can use transforms to find the latest document among all the documents that have a certain unique key.

The burn rate alerting calculates the rate at which SLOs are failing over multiple windows of time, is less sensitive to short-term fluctuations by focusing on sustained deviations, and it can give you an indication of how severely the service is degrading and helps prioritize multiple issues at the same time. Here we have a graph of burn rate alerting with multiple windows. So we have two windows for each severity, a short and a long one. The short window is 112 of the long window so when the burn rate for both windows exceeds the threshold, the alert is triggered. The pros with burn rate alerting is that it has a reduced alert fatigue, improved user experience, a flexible alerting framework, and a good precision. The con at the moment is that you have lots of options to configure, but this will be improved with future versions of Elasticsearch.

So it's demo time. So here there is some demo that I made up for you around the transforms. You can see you can create the transforms there. You can check the data behind it. You have stat, JSON, messages, and some preview. And you can check the health of each transform. It could be degraded, healthy, or even failed. If you have some issues, you can troubleshoot it right from this screen. So let's try to create some SLOs. You go to observability, SLOs, and create a new SLO. You choose the type of the slide that you want, the index. In my case, I will use an serverless index, and a time seven field. You add your query filter that you're interested in, the good query that you like for your SLO, and the total query. Afterwards, you have an interesting selection here to partition by.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

You Don’t Know How to SSR
DevOps.js Conf 2024DevOps.js Conf 2024
23 min
You Don’t Know How to SSR
The Talk covers the speaker's personal journey into server-side rendering (SSR) and the evolution of web development frameworks. It explores the use of jQuery for animations in SSR, the challenges faced in integrating React with Umbraco, and the creation of a custom SSR framework. The Talk also discusses the benefits of Next.js and the use of serverless artifacts for deployment. Finally, it highlights the features of Astro, including its function per route capability.
Multithreaded Logging with Pino
JSNation Live 2021JSNation Live 2021
19 min
Multithreaded Logging with Pino
Top Content
Today's Talk is about logging with Pino, one of the fastest loggers for Node.js. Pino's speed and performance are achieved by avoiding expensive logging and optimizing event loop processing. It offers advanced features like async mode and distributed logging. The use of Worker Threads and Threadstream allows for efficient data processing. Pino.Transport enables log processing in a worker thread with various options for log destinations. The Talk concludes with a demonstration of logging output and an invitation to reach out for job opportunities.
AWS Lambda under the hood
Node Congress 2023Node Congress 2023
22 min
AWS Lambda under the hood
Top Content
In this Talk, key characteristics of AWS Lambda functions are covered, including service architecture, composition, and optimization of Node.js code. The two operational models of Lambda, asynchronous and synchronous invocation, are explained, highlighting the scalability and availability of the service. The features of Lambda functions, such as retries and event source mapping, are discussed, along with the micro VM lifecycle and the three stages of a Lambda function. Code optimization techniques, including reducing bundle size and using caching options, are explained, and tools like webpack and Lambda Power Tuning are recommended for optimization. Overall, Lambda is a powerful service for handling scalability and traffic spikes while enabling developers to focus on business logic.
Advanced GraphQL Architectures: Serverless Event Sourcing and CQRS
React Summit 2023React Summit 2023
28 min
Advanced GraphQL Architectures: Serverless Event Sourcing and CQRS
Watch video: Advanced GraphQL Architectures: Serverless Event Sourcing and CQRS
GraphQL is a strongly typed, version-free query language that allows you to ask for specific data and get it in JSON format. It simplifies data retrieval and modification by allowing the server to handle all necessary operations. Serverless architectures, such as AWS Lambda, are scalable, cost-effective, and good for event-driven applications. Event sourcing and CQRS are techniques that ensure consistency and separate reading and writing parts of an application. Building a GraphQL API with commands and queries can be achieved using AWS AppSync and DynamoDB. This approach offers low latency, scalability, and supports multiple languages. Challenges include application complexity, data modeling, and tracing, but starting with simplicity and making something work first can lead to success.
Observability for Microfrontends
DevOps.js Conf 2022DevOps.js Conf 2022
24 min
Observability for Microfrontends
Microfrontends follow the microservices paradigm and observability is crucial for debugging runtime production issues. Error boundaries and tracking errors help identify and resolve issues. Automation of alerts improves incident response. Observability can help minimize the time it takes to understand and resolve production issues. Catching errors from the client and implementing boundaries can be done with tools like OpenTelemetry.
Observability with diagnostics_channel and AsyncLocalStorage
Node Congress 2023Node Congress 2023
21 min
Observability with diagnostics_channel and AsyncLocalStorage
Observability with Diagnostics Channel and async local storage allows for high-performance event tracking and propagation of values through calls, callbacks, and promise continuations. Tracing involves five events and separate channels for each event, capturing errors and return values. The span object in async local storage stores data about the current execution and is reported to the tracer when the end is triggered.

Workshops on related topic

AI on Demand: Serverless AI
DevOps.js Conf 2024DevOps.js Conf 2024
163 min
AI on Demand: Serverless AI
Top Content
Featured WorkshopFree
Nathan Disidore
Nathan Disidore
In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.
Building Serverless Applications on AWS with TypeScript
Node Congress 2021Node Congress 2021
245 min
Building Serverless Applications on AWS with TypeScript
Workshop
Slobodan Stojanović
Slobodan Stojanović
This workshop teaches you the basics of serverless application development with TypeScript. We'll start with a simple Lambda function, set up the project and the infrastructure-as-a-code (AWS CDK), and learn how to organize, test, and debug a more complex serverless application.
Table of contents:        - How to set up a serverless project with TypeScript and CDK        - How to write a testable Lambda function with hexagonal architecture        - How to connect a function to a DynamoDB table        - How to create a serverless API        - How to debug and test a serverless function        - How to organize and grow a serverless application


Materials referred to in the workshop:
https://excalidraw.com/#room=57b84e0df9bdb7ea5675,HYgVepLIpfxrK4EQNclQ9w
DynamoDB blog Alex DeBrie: https://www.dynamodbguide.com/
Excellent book for the DynamoDB: https://www.dynamodbbook.com/
https://slobodan.me/workshops/nodecongress/prerequisites.html
Serverless for React Developers
React Summit 2022React Summit 2022
107 min
Serverless for React Developers
WorkshopFree
Tejas Kumar
Tejas Kumar
Intro to serverlessPrior Art: Docker, Containers, and KubernetesActivity: Build a Dockerized application and deploy it to a cloud providerAnalysis: What is good/bad about this approach?Why Serverless is Needed/BetterActivity: Build the same application with serverlessAnalysis: What is good/bad about this approach?
Building a GraphQL-native serverless backend with Fauna
GraphQL Galaxy 2021GraphQL Galaxy 2021
143 min
Building a GraphQL-native serverless backend with Fauna
WorkshopFree
Rob Sutter
Shadid Haque
2 authors
Welcome to Fauna! This workshop helps GraphQL developers build performant applications with Fauna that scale to any size userbase. You start with the basics, using only the GraphQL playground in the Fauna dashboard, then build a complete full-stack application with Next.js, adding functionality as you go along.

In the first section, Getting started with Fauna, you learn how Fauna automatically creates queries, mutations, and other resources based on your GraphQL schema. You learn how to accomplish common tasks with GraphQL, how to use the Fauna Query Language (FQL) to perform more advanced tasks.

In the second section, Building with Fauna, you learn how Fauna automatically creates queries, mutations, and other resources based on your GraphQL schema. You learn how to accomplish common tasks with GraphQL, how to use the Fauna Query Language (FQL) to perform more advanced tasks.
Scaling Databases For Global Serverless Applications
Node Congress 2022Node Congress 2022
83 min
Scaling Databases For Global Serverless Applications
WorkshopFree
Ben Hagan
Ben Hagan
This workshop discusses the challenges Enterprises are facing when scaling the data tier to support multi-region deployments and serverless environments. Serverless edge functions and lightweight container orchestration enables applications and business logic to be easily deployed globally, often leaving the database as the latency and scaling bottleneck.
Join us to understand how PolyScale.ai solves these scaling challenges intelligently caching database data at the edge, without sacrificing transactionality or consistency. Get hands on with PolyScale for implementation, query observability and global latency testing with edge functions.
Table of contents- Introduction to PolyScale.ai- Enterprise Data Gravity- Why data scaling is hard- Options for Scaling the data tier- Database Observability- Cache Management AI- Hands on with PolyScale.ai
Live e2e test debugging for a distributed serverless application
TestJS Summit 2021TestJS Summit 2021
146 min
Live e2e test debugging for a distributed serverless application
WorkshopFree
Serkan Ozal
Oguzhan Ozdemir
2 authors
In this workshop, we will be building a testing environment for a pre-built application, then we will write and automate end-to-end tests for our serverless application. And in the final step, we will demonstrate how easy it is to understand the root cause of an erroneous test using distributed testing and how to debug it in our CI/CD pipeline with Thundra Foresight.

Table of contents:
- How to set up and test your cloud infrastructure
- How to write and automate end-to-end tests for your serverless workloads
- How to debug, trace, and troubleshot test failures with Thundra Foresight in your CI/CD pipelines