Video Summary and Transcription
This Talk provides an introduction to Serverless Observability and SLOs, explaining the concept of SLOs and their dependency on transforms. It highlights the codependency between SLOs, SLAs, and SLIs and discusses the importance of well-defined SLOs. The Talk also demonstrates how to create and monitor SLOs and alert rules, emphasizing the benefits of burn rate alerting in reducing alert fatigue and improving user experience.
1. Introduction to Serverless Observability and SLOs
Hi, I'm Diana Toda. I'm here to present Serverless Observability where SLOs meet transforms. We'll discuss the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and have a short demo. Server level indicators are a measure of the service level, defined as a ratio of goods over total events. The service level objectives are the target values for a service level, and the error budget is the tolerated quantity of errors.
Hi, DevOps.js. I'm Diana Toda. I'm an SRE at Elastic, and I'm here to present Serverless Observability where SLOs meet transforms. So we're going to talk about the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and we're going to have a short demo.
So a bit of context. With Elastic's migration to serverless, we had the need to come up with a new idea around the rollup aggregations. So Elastic has a multi-cluster infrastructure, and we needed to move away from rollup aggregations and search due to some of their limitations. Then we started creating the transforms.
So let's start with some definitions. Server level indicators, as probably you well know, are a measure of the service level provided. They are usually defined as a ratio of goods over total events, and they range between 0 and 100%. Some examples, availability, throughput, request latency, error rates. The service level objectives are a target value for a service level measured by an SLI. Above the threshold, the service is compliant. For example, 95% of the successful requests are served under 100 milliseconds. The error budget is defined as 100% minus the SLO. So it's the quantity of errors that is tolerated, and the burn rate is the rate at which we are burning the error budget over a defined period of time. It's very useful at alerting before exhausting the error budget.
2. Codependency Between SLOs, SLAs, and SLIs
So we have a codependency between SLOs, SLAs, and SLIs. How do we recognize the good SLO versus a bad SLO? A well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations. The SLO architecture relies on transforms to roll up the source data and summarize it into entity-centric indices. Transforms enable you to convert existing indices, providing new insights and analytics. Burn rate alerting calculates the rate at which SLOs are failing over time, helping prioritize issues. It has reduced alert fatigue, improved user experience, and good precision. Let's move on to the demo where you can create and monitor SLOs.
So we have a codependency between SLOs, SLAs, and SLIs. So how do we recognize the good SLO versus a bad SLO? A bad SLO is vague, subjective, it lacks quantifiable metrics, it has an undefined threshold, and no observation window. A good SLO is specific and measurable, user-centric, quantifiable, and achievable, and it's time frame defined. So a well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations, which are essential elements for effective monitoring and evaluation of service reliability.
The SLO architecture, basically the SLOs rely on the transform surface to roll up the source data into roll-up indices. To support a group by or the partition by feature, Elastic has added a second layer which summarizes the roll-up data into an entity-centric index for each SLO. This index also powers the search experience to allow users to search and sort by in any SLO dimension. So what are transforms? Transforms are persistent tasks that enable you to convert existing Elastic search indices into summarized indices, which provide opportunities for new insights and analytics. For example, you can use transforms to pivot your data into entity-centric indices that summarize the behavior of users or sessions or other entities in your data. Or you can use transforms to find the latest document among all the documents that have a certain unique key.
The burn rate alerting calculates the rate at which SLOs are failing over multiple windows of time, is less sensitive to short-term fluctuations by focusing on sustained deviations, and it can give you an indication of how severely the service is degrading and helps prioritize multiple issues at the same time. Here we have a graph of burn rate alerting with multiple windows. So we have two windows for each severity, a short and a long one. The short window is 112 of the long window so when the burn rate for both windows exceeds the threshold, the alert is triggered. The pros with burn rate alerting is that it has a reduced alert fatigue, improved user experience, a flexible alerting framework, and a good precision. The con at the moment is that you have lots of options to configure, but this will be improved with future versions of Elasticsearch.
So it's demo time. So here there is some demo that I made up for you around the transforms. You can see you can create the transforms there. You can check the data behind it. You have stat, JSON, messages, and some preview. And you can check the health of each transform. It could be degraded, healthy, or even failed. If you have some issues, you can troubleshoot it right from this screen. So let's try to create some SLOs. You go to observability, SLOs, and create a new SLO. You choose the type of the slide that you want, the index. In my case, I will use an serverless index, and a time seven field. You add your query filter that you're interested in, the good query that you like for your SLO, and the total query. Afterwards, you have an interesting selection here to partition by.
3. Creating SLOs and Alert Rules
You set your objectives and target SLO for a specific time window. Add a title, description, and tags to identify your SLO. Choose a burn rate rule and configure it with various options. Save the rule and search for your SLO. The panel provides an overview, alerts, and options to edit availability or create new alert rules.
For example, serverless project ID SLO, or cluster type, etc. You set your objectives for the time window, depending on what you want to do, and target SLO, let's say, for example, 99%. You add the title to your SLO, a short description, what it does. And basically, you can add some tags to better identify your SLO. And if you want to choose a burn rate rule, you click on the tick there, and there you have it. You have your SLO, which prompts you immediately to create a burn rule. You have lots of options to define the hour, the action groups, etc. And you can select an action depending on where you want to be alerted. You save the rule, and then let's start looking for it. You can start typing the name of your SLO. And as you can see, I have a list of my SLO grouped by serverless project ID. And in this panel, in the screen, you can have the overview, the alerts. You could go in actions, edit availability, or create a new alert rule.
Comments