There's probably some number of services in a service graph, and if you're trying to troubleshoot what's gone wrong at any given moment, trying to do that just using system level logs within the context of a given service or high-level metrics probably isn't going to get you where you want to go. So, having the context of distributed traces in place can really help shorten the meantime to recovery for any particular failure in your system.
Yeah. Absolutely. Awesome. And Mark, do you have anything you want to add on to this? What does GitHub do when it comes to preventing downtime? Yeah. I'll say tracing is absolutely great. We use that as well. I think one thing I'll add that is a little tricky even with tracing is that I think for us it's very important to be able to tell what kind of pressure GraphQL exerts on our systems, especially data stores. So, even... Especially with things like Data Loader that might fetch data from a different context in a specific field. It's sometimes you can look at a field and see that it's resolving very fast, but maybe it's enqueuing a lot of data to be loaded later. And if you don't observe this well, it can be really tricky to find the root cause of increased writes or reads on our database. So, for us, we noticed the thing that gave us the most bang for our buck was looking at external calls that GraphQL queries make and seeing if a certain query, for example, has started making 500 MySQL queries because something was not optimized or we're missing a data load or somewhere. So, I think, yeah, having observability into external calls that are coming out of GraphQL, even more than just timing GraphQL resolvers themselves, have been really helpful.
Yeah, I mean, that makes a ton of sense because with that I.O., eventually somewhere that has to happen. And so, if you don't have the right checks in place, the bottleneck might end up and it might affect other parts of the system, which is more tricky, right? Like, it starts affecting, creating latency for other requests, kind of like a red herring situation. So, yeah, that's really invaluable information. Thank you. Awesome. So, we've talked a little bit about kind of like keeping systems reliable and downtime. Let's take it to a different route and go a little more positive, which is, how do you handle like large spikes in traffic? Ideally, this is something kind of that we want, right? We want to see more traffic coming to our APIs. But, at the same time, especially in situations where you might not know that that traffic is coming, what are your preferred methods for handling that? And I'm going to start with Mandy this time and then just kind of jump around.
Well, I think you're probably going to start noticing a recurring theme in some of my answers because I'm a big fan of observability tooling with respect to GraphQL APIs. So, that's definitely a really important to have in place because it can help you identify patterns as they begin emerging rather than being surprised by something that you don't need to be surprised by. And also with those kinds of tools in place, you can configure them to give you alerts to when something unexpected is happening in your system whether it's around errors or a big spike in traffic and give you a heads up that there might be something that you want to deal with preemptively before something bad happens. I like that. So, kind of swooping in before that spike happened, just knowing, once you get outside of the bounds of normalcy for your API, being alerted to that so that you can react immediately. Yeah, something along the lines of an ounce of prevention is worth a pound of cure, right? Yeah. Yeah.
Comments