Video Summary and Transcription
Monitoring and observability are important for catching bugs before they become noticeable. Examples of monitoring issues include confusion and frustration when monitoring leads to misunderstandings. Teamwork is essential for effective monitoring, automation can streamline processes and improve efficiency. Custom monitoring is necessary to prevent hazards and unnecessary alerts can hurt productivity. Challenges include relying too much on monitoring without addressing root issues and struggling with manual configuration.
1. Introduction to Monitoring and Observability
What happens when your apps crash consciously? Cut them as free revenue tanks? We feel powerless. Today I'm going to share with you the model tests and monitoring methods that can help you prevent disasters. It's important for teams to monitor systems and catch bugs before you notice them. Let's explore the important concepts of monitoring and observability. Monitoring and observability work together to detect issues early and prevent future problems.
What happens when your apps crash consciously? Cut them as free revenue tanks? We feel powerless.
Hi, I'm Martin Tomician, a senior thought manager here at NippoBip, and today I'm going to share with you the model tests and monitoring methods that can help you prevent disasters. So let me first introduce myself. I'm 30 years old and 55% deaf in both ears. And in my seven years at NippoBip, I've mastered skills like Web Infrastructure, React, Webpack, Microcontent, Automation, Monitoring, Improving Experience, and much more. But hey, I'm not just a coder, and I'm not conferencing or mentoring. If you find me exploring the world, testing outstanding photographs, and also collecting rubber ducks and magnets each week, hang on because it's going to be alright.
You know when Netflix stops working and the customers get mad fast, and you have a bad day, drivers rape customers and bus. That's why it's important for teams to monitor systems. They play detective so they can catch bugs before you notice them.
And you know capturing and catching issues early keeps up the humming of customers ready. So let's explore the important concepts of monitoring and observability. And let's take an example. Little Pedro fell hard and scraped his knee. His little assistant monitored how we limp and clean the patterns on his wound. The robot investigated why he fell and realized he had untied shoes. And together they cared for the injury and prevented the next one. Monitoring here and observability work together. And monitoring is basically like a little assistant. Keeps on going and works to see if things are working right. And she checks metrics like sees if the pedal limps and knows immediately that something is wrong. Logging here helps monitoring by checking and recording information that can be useful later. But monitoring only alerts you to issues. Monitoring can feel like wandering in the dark. You are not sure what is going on. You are unsure about the root cause. Your ability is like you flip on a head light switch. You know you light everything and you can see the logs, metrics and traces clearly. And you immediately know why Pedro fell. Because of his untied laces.
2. Examples of Monitoring Issues
When you have mechanics and they rely on diagnostics so they can precisely fix your car, durability is basically a very similar thing. Monitoring tells us that something is wrong but not how to retain our users. In InfoPip, we use Greylock for logging, Grafana for dashboards, Prometheus for metrics, Obgeni for alerts, and Sensory for user-facing issues. Poor choices can affect company, apps, and reliability. The first example is ShopPass. The monitoring led them completely confused and frustrated. So they had to change and improve. The second example is Tom who checks Tickethype's website, ensuring everything is right.
And one more example is blinking check engine inside the car line. Because it does warn you about the issues but not about the cause. When you have mechanics and they rely on diagnostics so they can precisely fix your car, durability is basically a very similar thing. Because it looks under the hood of the software, pinpoints the problem so the thing can be fixed.
And monitoring here, for example, tells us that something is wrong but not how to retain our users. And you know with monitoring tools, they give our apps superpowers. Like heroes, apps can seem invincible but they rely on the talent behind the scenes to help spot bugs early. In InfoPip, for example, we use Greylock for logging, Grafana for dashboards, Prometheus for metrics, Obgeni for alerts, and Sensory for user-facing issues. And I'm going to show you some real monitoring workstories and use practical examples. Because of limited time, I'm not going to focus so much on boring tool demos. So let's start.
The Greek philosopher Plato once said that a good decision is based on knowledge and not on numbers. And our decisions do affect company, apps, and reliability. So it's important to see how poor choices can affect this and cause instability. And it is important to understand how our application is working, what is the correct behavior and what is its current performance so that we can do proper logging, monitoring, and troubleshooting, and so that we can make sound decisions.
The first example is ShopPass. Their system started to crash as soon as their shopping trap in jumped and managers were in panic. Now they just started buying this, this, this, this monitoring tool without any strategy, everything was disjointed, everyone was confused, and they didn't have any systematic approach. What I mean, you know, we've all tried putting out fires without actually seeing the full picture. And this is the fat anti-pattern, which is called tool obsession. When we become so obsessed with certain tools that we lose perspective, because it's so easy to think that the latest tool will be the super bullet and only to end up distracting from delivering actual value. I don't know, we shouldn't put all our faith in the tools, because you can make teams think that they are a magic wand that leads to success. Because remember, Cinderella's wedding and mother even warned her to call spells for Rerop at midnight. And the same thing is here, because nothing can replace the hard work.
So what is the problem for ShopPass? The monitoring led them completely confused and frustrated because the network was so green, but the users still complained, they wasted time trying to decode those contradictions and errors, and they didn't really have any insights about those critical backend processes. So they had to change and improve. And that's exactly why they did, because they looked closely at the vital signs and metrics, saw what is important, what keeps them healthy and on track, and they made sure to cover that and they simplified tools so that they can look only by the matter of the most. And they made a focused game plan, which allowed them to spot the early and celebrate progress and make sure the products are more stable.
The second example is, let's imagine Tom. He checks Tickethype's website, kind of like his doctor, ensuring everything was right, checks servers, speeds, databases, and especially errors.
3. Importance of Teamwork in Monitoring
And if something is wrong, he gives a call to Tim, hey, hey, can you fix it? Or by updating monitoring tools, and Tom helps keep the web in good shape. Monitoring works better as a team effort. Mom monitored things alone. The company realized that monitoring is too critical for only one person, so they needed teamwork. They automated as much as possible so that they can streamline things, and so that everything can work pretty much smoothly. And we have Shopvac, for example, who said monitoring, but that is too quickly, like glulululu, just using reports.
And if something is wrong, he gives a call to Tim, hey, hey, can you fix it? Or by updating monitoring tools, and Tom helps keep the web in good shape. So what is the problem here, actually? So reading and monitoring, like it's annoying smoke detector, keeps snoozing. Because in one or two firefighters, it's already enough to handle everybody alone. You need to hold Tim, listening for alarm, ready to grab the hose together. Because ignoring only one is can only burn our customers.
And we need to work together, because monitoring works better as a team effort. Because all of us, devs, ops, network, SRE, all of us think differently, and our perspectives do help catch problem faster. And in DevOps mindset, it's also about combining forces to monitor systems, you know, because we will be able to keep things running more smoothly. So what is the problem? Mom monitored things alone. But you know, he could only cover so much, and communication was a problem, and he had to wait for alert to get resolved. And all that increased his stress, and he was prone to mistakes.
But the company realized that monitoring is too critical for only one person, so they needed teamwork. And they did that, and took everyone on board to fix their problems. They trained the IT team on the best practices, and made it so that there are no blind spots, and so that the whole team worked together. And they automated as much as possible so that they can streamline things, and so that everything can work pretty much smoothly. And we have Shopvac, for example, who said monitoring, but that is too quickly, like glulululu, just using reports. And then they got t-shirts, and all of those rush alerts backfired.
4. Improving Monitoring and Alerting
We learned the hard way that custom monitoring is essential to prevent hazards. Checkbox monitoring can give you untrustworthy data. Unnecessary alerts can hurt productivity and disturb sleep. IT teams became overwhelmed with constant, unimportant alerts, missing critical issues. They struggled with monitoring but took three steps to improve: tailoring metrics, improving communication, and using specific tools. Using an open-source Grafana dashboard helped us spot and create alerts based on key metrics. Connecting alerts to Slack simplifies checking. More examples in the blog post.
Because we are all hurried in some way to meet a deadline, and only to have things blow up later from those cat corners. And Shopvac learned it the hard way that they should have custom monitoring, which can help them prevent hazards when they have problems. And this is my favorite anti-panel, and it is called checkbox monitoring, where you set monitoring just to see that you have it. And that gives you no problems and solutions, because you can have mostly untrustworthy data.
For example, we had issues, my team, with unnecessary alerts, because they do hurt productivity. Imagine getting an alert at 1 a.m., like, hey, the t-shirt is expired just one month ago, please replace it when you already have done it. And frustratingly, for example, we couldn't disable those useless interrupts, and you know, they were just disturbing our sleep, basically.
IT teams loved their monitoring systems at first, but then became overwhelmed mad, because they were getting constant, unimportant alerts, they were basically drowning them out. They were basically drowning out the real trust. They missed some really critical application issues, and they did believe they had a good, reliable protection, but a big problem came unnoticed. And they also had a big problem when they had incidents, because they had no clue what happened based on the metrics.
So what did they do? So, they struggled with monitoring. So they took three steps. First, they tailored their metrics to fit their needs. They set up the contact review process to improve their metrics, and they improved communication so that everyone can be trained on how to do things properly, so that they can do it better.
And I will share with you one example with you. We have open source Grafana dashboard with HAProxy that we use, and you can send it on this QR code or link if you want to. But you know, for example, we didn't just copy-paste this. We carefully chose what we needed exactly, and we made sure it would work for what we exactly need. And it is working. It looks like this, for example. We monitor it regularly and adapt it as we need something. And you know, it really helped us. We managed to spot issues, and we even managed to create some alerts based on the key metrics. And I also want to show you one more tip. When alerts happen, it is useful to have a list of the steps to investigate or troubleshoot, like which Grafana dashboard and which Greylock clock to check, because this really helps a lot. You can get alerts on the phone. Great. But if you can connect your alerts to Slack also, it can make it easier for you to check the full alerts. You can see the snippet here, but I'm going to share more examples in my blog post.
5. Monitoring Challenges and Solutions
Shopbit relied too much on monitoring without addressing root issues, leading to recurring problems. They learned to investigate and reinforce weak spots proactively. Subdev struggled with manual configuration and set up monitoring for every server by hand. They implemented standardized configuration, source control, and templates to improve efficiency and quality.
The next example I want to show you is Shopbit now. They sell products online, they closely monitor their website, but they rely on monitoring too much, and they do not really use the change to improve. And this is similar to like if you have leaky pipes and creaky floors, and instead of picking the root issues, all you do is slap on blankets and tape and call it monitor. Because that happens with systems too. This means not really fix the core problems and monitoring can provide temporary relief, but technical doubts will just keep piling up. We need sustainable solutions instead of bend-a-fixes.
And the problem was that they trusted monitoring too much, got too comfortable with those alerts to support an issue, so they kept getting the same problems again and again. For example, their websites were overloaded with traffic. But instead of investigating those root issues, they just kept applying the quick fixes. So the same problems kept returning. So what did they do? They actually asked why this happens, let's get to the bottom of this, and then managed a push team to actually reinforce those weak spots proactively. And monitoring tools also now focus more on improving reliability instead of just reacting to things.
And then we have Subdev for example. They are growing fast as a dev company and they have now hundreds of servers. And the IT team is struggling with setting up monitoring for every server by hand. And this gets harder and harder. And friends, monitoring tools and configuration manually wastes a lot of time, and we are prone to mistakes. So let's talk about a smarter way how we can do this. Because for Subdev, the problem was the things got more complex, it was hard to keep up, and they just were missing things, communication was a problem, couldn't maintain it. And they realize, you know, trying to fix issues as it happens just wasn't working at all, you know. So they needed to do something. And they did several steps. So they took the monitoring, which is standard math, and they put standardized configuration so that it is aligned across everywhere. Second, they're using source control to have a single source of truth for everything and to improve teamwork. And three, they created a template so they can reuse this and they can save time and ensure quality.
6. Automated Dashboard and Monitoring Challenges
We automated the Kapana dashboard and caught several issues. Monitoring for Nest Hub was like constant monitoring of fatigue instead of focusing on critical issues. We need helpful information that doesn't overwhelm our engineers. We should make our tech work for us, not against us. Nest Shop had too many alerts and missed serious issues. They prioritized critical alerts for immediate attention.
And I will just show you one interesting example of what we did. We have automated the Kapana dashboard. We only want hardcoded input, which is the team owner name. Everything else is fully calculated by primitive queries. And we have it then in the dashboard like this. We track the status of our virtual machines and of our servers. And it helps us show the statuses very quickly. And also we managed to catch several issues that have appeared.
The dashboard had been a really good idea, and it was extremely useful for us to monitor performance. But we have two more examples. Monitoring for Nest Hub was like helicopter parents, you know, like constant monitoring of fatigue that happened because they were getting notified about paper cuts instead of only about real emergencies like broken bones. And, you know, good monitoring tells us when our site is truly sick, not when it just has a snipples. We need to give our engineers helpful information that can not overwhelm them, but so that they can focus on those critical issues.
Because I think this antipersonnel is pretty much self-explanatory. We shouldn't be like calling 3AM like, hey, the TV got expired three months ago. Please replace it. Because human notification will just teach us to ignore those alerts, you know. And we've all been there, you know. But the point is, we need our tech to drive us, help us, not drive us crazy. And you know, it simply may take into our devices also for help and that's not good. And you know, being human, we are human. And we always make mistakes. But we just need to make sure to get better. We should now make our tech work for us, not against us.
And the problem for Nest Shop was they simply had too many alerts and they weren't able to tell real emergency from less important ones. And they missed a lot of serious issues that were burning among less urgent alerts. And with so many contacts coming in, they couldn't really use resources properly. And that was a problem for them. They also took several steps. They actually prioritized critical alerts so they can be sorted ASAP, especially when they matter the most.
7. Improving Alerting and Dashboard
They've changed the threshold and connected alerts to revenue. Optimized the process and boosted efficiency. Set up selective notification for non-urgent alerts. News Hub realized a single performance measure was too basic. They improved metrics and diagnostics to fix important issues and improve application speed. Simplify and focus dashboards to answer key questions quickly.
They've changed the threshold so they have less false positives. They connected alerts to the revenue and what really can impact them. And they optimized the process so that the right people can actually tackle the right problems and this also helped them boost efficiency.
Another one creative idea for what you can do for non-urgent alerts is, for example, you have the alert that a certificate will expire in a month and a half, and you want to be notified but not like at 2am. You can set up your alert to notify you only during working hours or during some specific days. And this can help you use Nature's Queries in a really creative way. And it's good to do things like that, you know.
But also, for example, we can have News Hub. Millions of news and they care only about loading the content quicker, let's say, when they are growing and have more popularity. But, you know, only covering the files loaded is not enough because complex software, we cannot produce it only one simple dump metric. The declining webline may ignore real pain points. And for example, grabbing medicine is like if you track a patient and only track weight and height, but not track blood work to diagnose issues. And this interesting end pattern is something called the big dump metric. And you can imagine that, for example, you have 95% of the respondents in 180 milliseconds. I don't know how does it help us, what does it tell us? Metrics should be useful. They should tell us information that can actually help us, that we can actually benefit from. This doesn't help us at all, you know, it's just useless.
And for example, if we have one response time for absolutely everything, like one single metric, or showing only top of the view without any other details. It doesn't give us any information. We cannot do anything with it, you know. And the News Hub, they realized that a single performance measure was actually too basic, couldn't really differentiate between important things. So they mischanged it to fix really important issues and misunderstood the real problems, like application speed, and you know, they overlook all of that. And it didn't go really well. So they needed to improve it. So they took three steps. They track performance metrics, they needed to put the size diagnostics, then they set up monitoring for tools that can help them, and they are only using those specifically. And third, they are reviewing metrics to constantly improve them.
And just to briefly mention also, they improve speed and reliability, but, no problem, can also be too many dashboards. Because you can put it on roaming, you don't know where to look, it takes you too much to troubleshoot. So we should simplify and focus dashboards so that they can answer key questions without clutter and very precisely and quickly.
8. Key Takeaways and Next Steps
Monitoring is crucial, but complex. Understanding correct behavior and acting quickly is important. Monitoring is a constant process, always improving. Learn more on Hashnode and stay connected for blog posts and insights. Connect on social media for links and articles. Thank you for listening, looking forward to your questions!
Anyway, just to summarize, the key takeaway is monitoring is crucial, but it's complex, and you have to be able to do it right. You have to understand what is the correct behavior for each service, because issues will always happen. That's why we monitor. Knowing something is wrong is not enough. You have to be able to also act quickly, and that's why it's important to get to know your services.
So, and you have to also be aware that monitoring is a constant process. You should work on improving your services constantly, because you should be able to stay vigilant, alert, so that you can react quickly if necessary. And you know, it's hard to cover everything in 20 minutes, and if you want to learn more about this topic, I have prepared a small blog post on Hashnode that provides a little bit more hints about the information. But I will also write a series of blog posts in collaboration with InfoBeep DevRel and InfoBeep developers, which will be published over the next few months.
So, if you want to be in touch and see those blog posts, feel free to follow me on Hashnode, and I can also immediately recommend you to a meeting book that provides really great insights into reliability and monitoring. Those are Radical Monitoring by Mike Julian and released by Michael D. Nygart. But you can also connect with me over the socials, because I will be sharing all of the links, all of the articles there also for anyone who is curious. And you know, you can connect with me over this link that has all of my socials links. You can find me also on Twitter, which is now called Axe, or on LinkedIn, where you will for sure have those links.
And you know, you can also just type the Bitly link if you want to get to it quicker. I would like to thank you for listening to me. I hope that my session provided you with some useful insights, and I'm looking forward to hearing all of your questions that you may have. So thank you and see you soon. Thank you! Bye!
Comments