Hi everybody! I'm very happy to be here to have the opportunity to share my thoughts and learnings about performance with GraphQL specifically in a service mesh. Let me quickly introduce myself. I'm Robert Horslowski working at Instaun in IBM company and in 2016 I gave a talk about GraphQL in Relay. Later in 2018 I published this video course about a full-state trailer clone on top of GraphQL. By then 2019 I found a subtle performance issue in this live demo application which brings all this rolling.
But let's first dive into and see what do we mean with distributed mesh. So, actually we don't have only one service but typically our landscape from an infrastructure looks like this. So, of course there can be one or two machines going down and so on. But this typically handled. But what is then happening on the service level. And here also this is typically how a service mesh looks like when you look into it and have a representation of the traffic of the communication. And also here there are of course many communications running and this is typically not good visible if you have not such a tool.
But first, let's ask the question, why is performance monitoring necessary? Yeah, it's quite simple. Users don't like to wait. And typically when we have today a service mesh or at least some service is used. Maybe this is a tool for a payment service or anything like this. And typically, other services depend on that. And this needs to somehow be tracked. And in case of a failure, of course, should be easily found and fixed. Why is this important? Typically, today, when APIs are the center of a business, for instance, then also here, it's very important that timings are as expected. So nobody wants to wait for something and later find out it was not their fault, but somebody else. And even while there might have been a contract, so-called SLA, where you define a specific service needs to be reacting sometime. And if it does not, that's where somebody has a problem and the business has a problem at the end.
But let's come to investigating a real performance issue. As I mentioned, I had a problem with my live demo at the time. It's a simple Kanban board with some database transactions or a backend where you have some data stored, of course, but also, at that time the communication of the database was graphical. So, for some reason, it was very slow, but on other times, it was very fast. I couldn't say where the problem is, but sometimes it was really really slow, and there's only the tool out there, or it was there, it was called ApolloEngine. It was quite simple to just add an API key into the Apollo server when using the Apollo server library, and then it automatically tracks these metrics and showed them here in the board. So you can see here, this is the variance, let's say, or the spectrum of the response times, up to 13 seconds for a call, which of course is not acceptable, and there are some more information like on the right, so the number of queries and so on.
Comments