Panel Discussion by Apollo Team: GraphQL Durability

4 authors

Bookmark

Kurt Kemple, Marc-André Giroux, Mandi Wise, Tejas Shikhare

FAQ

The panel discussion focuses on the durability of GraphQL APIs, particularly how to maintain their scalability and reliability under various conditions, and handling GraphQL at scale.

Testing GraphQL APIs is challenging due to the infinite possibilities of queries and the contextual information used by resolvers. The recommendation is to focus on integration testing and testing middleware layers like rate limiting and error handling.

At GitHub, GraphQL API scaling is managed by using a custom implementation of data loaders to avoid N+1 queries and focusing on server-side application level caching.

Netflix handles spikes by scaling GraphQL servers horizontally, implementing service side throttling, client retries, and using an L7 proxy application gateway to reject excessive requests.

Effective caching strategies for GraphQL APIs include using persisted queries to reduce request size and enable HTTP caching, employing data loaders to batch and cache requests, and implementing application-level caching at the resolver level.

Panelists suggest handling errors in GraphQL by translating GraphQL-specific errors to standard HTTP status codes, using the 'errors as data' approach to enrich client applications with detailed error information, and ensuring robust observability and error tracking systems.

Panelists recommend resources like Marc Andre's book 'Production Ready GraphQL', and general materials on distributed systems, site reliability engineering, and domain-driven design to understand and build reliable APIs.

graphql

33 min

02 Jul, 2021

Comments

Video Summary and Transcription

The panel discussion focuses on the durability and scalability of GraphQL APIs. Testing GraphQL APIs is particularly challenging due to the infinite possibilities of queries and contextual information used by resolvers. Effective strategies include integration testing, observability, and unit testing. GitHub handles GraphQL API scaling by using a custom implementation of data loaders and server-side application-level caching. Netflix addresses traffic spikes through horizontal scaling, server-side throttling, and client retries. Effective caching strategies for GraphQL APIs include persisted queries, data loaders, and application-level caching. Error handling in GraphQL can be improved by translating errors to HTTP status codes and using the 'errors as data' approach. Recommended resources for building reliable APIs include 'Production Ready GraphQL' and materials on distributed systems and site reliability engineering.

1. Introduction to GraphQL API Durability

Short description:

Today, we have an exciting panel discussing the durability of GraphQL APIs and how to ensure they can scale without issues. Mark from GitHub, Mandy from Apollo, and Tejas from Netflix will share their experiences. Let's start with the importance of testing in maintaining a reliable GraphQL API.

♪ ♪ ♪ ♪ Thank you for joining us today. We've got a really exciting panel. I'm very excited to be joined. There's a lot of excitement, in case you can't tell, by these wonderful folks who have been using GraphQL for quite a long time in a lot of different environments and just have really kind of pushed GraphQL really to the edges of what's capable. And so, you know, when we talk about GraphQL and dealing with GraphQL at scale, one of the things that we don't really talk about too often is kind of the durability of GraphQL APIs. And what does that really mean, durability? Well, it's kind of like the SRE-type focus on graphs, like how do we keep graphs up and how do we make sure that they're able to scale and that we're not going to encounter issues? So that's going to be a lot of what today's panel is going to be about.

I'm going to turn over the floor for a quick sec just in case anyone had some follow-ups on introductions there, if you had anything you wanted to add about what you're currently doing and kind of how you're working in the GraphQL space today. So I'm just going to go in order of how I see. So Mark, that would be your up first. Yes, sure. Thank you for having me on. I think as an introduction said, I work at GitHub, work on the API team. So we've got a set up that's not the most common for GraphQL where we use it as a public API for a third party. So yeah, I'm just excited to be chatting about this with this context in mind. Awesome, cool, and thank you so much again for joining us. And then Mandy, I've got you in the next window over. Hello. So I'm a Solutions Architect at Apollo, which means I work with a lot of our enterprise customers and see the kinds of interesting challenges they bump up against using GraphQL at scale everyday. So I'm very excited to be a part of this panel too. Awesome, cool. Thank you so much. And Tejas, that brings us to you. Yeah, sure. I'm a Software Engineer at Netflix, and I'm actually working on the API team at Netflix, and we are currently building GraphQL for our studio ecosystem. Awesome, very cool. So as you can see, we've got a wide range of focus here from some of the top GraphQL-consuming companies out there in the industry, and yeah, Apollo's own Mandy. So all right, thank you so much for joining us. And without further ado, let's go ahead and get into some of these questions. So I'm going to start it off with the first one, which is just like – it's actually pretty hard to know that a service is reliable without testing. So what types of testing do you find to be the most important when it comes to keeping a GraphQL API up and running smoothly? And I guess, yeah, I'll just lead off with Mark.

1. Panel Introduction and Member Introductions

Short description:

An exciting panel on using GraphQL, focusing on durability of GraphQL APIs at scale. Introductions from GitHub, Apollo, and Netflix representatives. Diverse perspectives from top GraphQL-consuming companies.

♪ Thank you for joining us today. We've got a really exciting panel. I'm very excited to be joined. There's a lot of excitement, in case you can't tell, by these wonderful folks who have been using GraphQL for quite a long time in a lot of different environments and just have really kind of pushed GraphQL really to the edges of what's capable. And so, you know, when we talk about GraphQL and dealing with GraphQL at scale, one of the things that we don't really talk about too often is kind of the durability of GraphQL APIs, and what does that really mean, durability? Well, it's kind of, like, the SRE type focus on graphs. Like, how do we keep graphs up, and how do we make sure that they're able to scale and that we're not going to encounter issues? So that's gonna be a lot of what today's panel is going to be about.

I'm gonna turn over the floor for a quick sec, just in case anyone had some followups on introductions there, if you had anything you wanted to add about what you're currently doing and kinda how you're working in the GraphQL space today. So I'm just gonna go in order of how I see. So, Marc, that would be you're up first. Yes, sure. Thank you for having me on. I think, as an introduction said, I work at GitHub, work on the API team. So we've got a set up that's not the most common for GraphQL, where we use it as a public API for third parties. So yeah, I'm just excited to be chatting about this with this context in mind. Awesome, cool. And thank you so much again for joining us.

And then Mandy, I've got you in the next window over. Hello. Yeah, so I'm a solutions architect at Apollo, which means I work with a lot of our enterprise customers and see the kinds of interesting challenges they bump up against using a GraphQL at scale every day. So I'm very excited to be a part of this panel, too. Awesome, cool. Thank you so much. And Tejas, that brings us to you. Yeah, sure. I'm a software engineer at Netflix, and I'm actually working on the API team at Netflix, and we are currently building GraphQL for our studio ecosystem. Awesome, very cool. So as you can see, we've got a wide range of focus here from some of the top GraphQL-consuming companies out there in the industry, and yeah, and then Apollo's own Mandy. So all right, thank you so much for joining us.

2. Testing GraphQL APIs and Ensuring Smooth Operation

Short description:

Testing a GraphQL API can be challenging due to the infinite possibilities of queries and the use of contextual information in resolvers. It's important to extract logic outside of the GraphQL layer for better testing. Integration testing, observability, and unit testing are beneficial in maintaining smooth-running APIs. At Netflix, unit testing, integration testing, and end-to-end testing with and without mocking are used. Error handling and additional production ideas are also prioritized.

We'll just go through again, and we'll switch up the rounding later. Sure. Yeah, so this is an excellent question. I think it's good to acknowledge right off that testing a GraphQL API is kind of difficult by the nature of GraphQL, right? There's almost an infinity of possibilities of how people could query your graph, which is a feature, but also something that makes it hard to test every possibility. There's also the fact that your resolvers that are backing your graph often use contextual information from that kind of infamous context argument. So if you only focus on testing the graph itself, it's very hard to be confident in the backing logic behind the graph.

So I think my main thing, and I think I've said that a lot before, but I think focusing on extracting your logic outside of the GraphQL layer makes it so much easier to test that your domain logic is well tested. And then you can focus on testing the GraphQL parts separately. So we like to focus on more integration testing for GraphQL, test as many different queries as we can that represent client use cases, and test all our middleware layers, so our rate limiting. We'll talk about this more, I'm sure, about everything that's with error handling on the GraphQL side, so that would be my advice.

Yeah, awesome. Thank you. Mandy, you have a follow up to that. What areas do you find that like testing to be the most beneficial in helping APIs stay running smoothly? So in terms of testing, there's another way you can think about it too with GraphQL APIs where because they're evolutionary in nature, you make sure that as you're releasing new features on your graph, that you're doing so in a way that doesn't cause breaking changes for existing clients, so your observability tooling is a really important part of that story. So what that means is in practice, you'd probably be collecting operation traces and making sure that your clients identify themselves when they're using your graph, so that when you push those changes, you can check against those operation traces and make sure that you're releasing changes to your graph in such a way that's not going to break queries that are being currently made by existing consumers of your graph.

I love it. Yeah, that's a whole other avenue to think of, which is like aside from the actual nuts and bolts testing is like the observability, seeing what's happening in an existing system and using that as a baseline for testing as well. It's really cool. Tejas, you have a followup? How do you all handle testing your APIs in Netflix? Sure. Sorry. I'm going to reiterate some of the same things Mark and Mandy mentioned. I really like unit testing for the code that is inside the logic, inside the resolvers and data loaders. It's nice to kind of separate that out. Integration test is really great for testing context passing between the parent and child, data fetchers or data loaders, etc. And then smug test, we use a lot, actually, and we find them extremely useful for end-to-end testing of your GraphQL queries. And these we do both with and without mocking. We really find that super useful. And then error handling is another one. You want to trigger error scenarios using mocks because not all the time you're going to be able to test that behavior well and test it end-to-end. And then for production, we have two other ideas I can share that we like to use.

3. Handling GraphQL Failures and Preventing Downtime

Short description:

We like to replay actual read queries from production and use canarying as strategies. It's important to focus on testing error paths and understand the error state when tests fail. Tejas will discuss how to handle GraphQL failures and prevent downtime.

We like to replay the actual read queries from production. This doesn't work well for mutation, but it at least gives you a little bit of extra confidence that they're always up and running. And then also canarying. If that works, subset of traffic before you promote a new code to product. So those are some strategies we use.

Yeah, all make a ton of sense. I love the focus you put on testing error paths, right? It's like oftentimes we might find ourselves getting into traps where we're testing the happy path and making sure that works, but doesn't really explain what happens when... like the scenario of when that test fails. What is the error state and what will we experience there? So, yeah, it's also very important. I actually... so this brings me into a really good next question, and so then we'll actually... I'll lead up with you, Tejas, if you don't mind. So how do you handle situations when, let's say, something was able to slip through testing, didn't catch a scenario, and you're having GraphQL failures? Or how do you prevent that? Like when a system starts encountering errors, how do you prevent too much downtime? I think you might have muted it again.

4. Handling GraphQL Failures and Downtime Prevention

Short description:

When encountering GraphQL failures, it's important to focus on reducing downtime and quickly resolving issues. This can involve actions like rollback or bouncing server instances. Great observability, distributed tracing, and metrics are essential for identifying problems and taking appropriate action.

So how do you handle situations when, let's say, something was able to slip through testing, didn't catch a scenario, and you're having GraphQL failures? Or how do you prevent that? Like when a system starts encountering errors, how do you prevent too much downtime? I think you might have muted it again. Yeah, so when APIs go down, we want to focus on reducing the mean time to resolution, so we want to be up and running quickly again. This could include what is the fastest action we want to take to make the systems come back online? This is maybe a rollback, or it could be bouncing the server instances if there's some issue with the system itself. To do that, we need great observability to our system and distributed tracing, and metrics are key here for us. Let's say we assume the issues in GraphQL server, at the end of the day, they are just regular servers, and we can apply the same practices to them. So we want to have everything from a memory leak to running out of threads, all the metrics there so that we can actually quickly track down what's going on and take the appropriate action based on that.

5. Strategies for Resilient GraphQL APIs

Short description:

To prevent future failures in GraphQL APIs, it's important to implement strategies such as avoiding excessive logic in the GraphQL service, utilizing federation at the gateway layer, and using GraphQL-aware IPC metrics. Another useful strategy is functional sharding, which involves separating instances for different types of operations and callers. Separating subscriptions from queries and mutations also helps with scalability and resource requirements. Distributed tracing is crucial for troubleshooting system failures in GraphQL, especially in distributed systems. Whether it's a monolithic API or part of a microservices architecture, GraphQL is often integrated into distributed systems.

And then to prevent some of these from happening in the future, GraphQL API for most people, they tend to implement as a one graph. So it can be a single point of failure for your business. It can, when it goes down, it can cause serious problems. So we want to not do, if possible, the three strategies I'm going to cherry pick from.

One, we don't want to do too much in the GraphQL service. It's meant for data fetching and data loading and try to avoid as much logic. So the surface area for failure decreases. And then for federation, which we do at Netflix, we can apply some of these tactics at the gateway layer because the federated services themselves don't have as much as an impact on the overall graph, assuming they're spread out. So that's a really helpful to make your service more resilient. And the one we have found extremely useful is GraphQL-aware IPC metrics. So generally, in GraphQL, you have different kind of errors, but they all show up as 200, right? Most of the time, your response is going to be 200. So you want to be able to take those 400s and 500s we had in REST and kind of set a different status and image that status to our metrics server. So we look at the introspective response that we are sending back and emit custom metrics based on what the error might be in the errors block or the data block. And then one last strategy I'll share. You know, we can go all day about this, but another one is, you know, GraphQL is a stateless service. So another strategy to improve resilience is functional sharding. We tend to use that across a lot of places in Netflix and, you know, basically, for example, we want to have a separate feed of instances for something like subscriptions versus, you know, querying and mutation so that you can actually kind of separate because they have different semantics on the server. You know, subscriptions is long-lived connections while querying and mutation might be quick-running. So you can separate failure based on that. And also, you might also apply this tactic to different kind of callers. For instance, your users who are high-priority users versus maybe some kind of internal backend application calling the app. So those are some strategies we use.

That's really cool. And another thing, too, that really helps, like at least separating subscriptions from queries and mutations is just like a scale thing. Like you said, you know, WebSockets are long-lived connections and have different requirements as far as hardware goes. You know, you need more memory and stuff like that than you would for most of your query and subscriptions things. So just by separating those might allow some independent scale as well, which is nice. So, yeah, moving on. Mandy, what do you think? Like, what are some of your preferred methods for, one, either dealing with failed systems or, like, making sure that you're in a place where they're not failing for at least too long? So just building off of what Tejo said around distributed tracing, I think this is really key because it's hard to troubleshoot some kind of system failure when you're just taking wild shots in the dark. And one of the things about GraphQL is that it is very often incorporated in some kind of distributed system, whether it's a monolithic GraphQL API in front of a bunch of microservices or if it's part of your PFF strategy or if you're using federation.

6. Handling Distributed Traces and Traffic Spikes

Short description:

Having the context of distributed traces in place can help shorten the meantime to recovery for any particular failure. It's important to observe external calls that are coming out of GraphQL and not just timing GraphQL resolvers themselves. Handling large spikes in traffic requires observability tooling to identify emerging patterns and provide alerts for unexpected events. Being alerted to spikes in traffic allows for proactive measures to be taken.

There's probably some number of services in a service graph, and if you're trying to troubleshoot what's gone wrong at any given moment, trying to do that just using system level logs within the context of a given service or high-level metrics probably isn't going to get you where you want to go. So, having the context of distributed traces in place can really help shorten the meantime to recovery for any particular failure in your system.

Yeah. Absolutely. Awesome. And Mark, do you have anything you want to add on to this? What does GitHub do when it comes to preventing downtime? Yeah. I'll say tracing is absolutely great. We use that as well. I think one thing I'll add that is a little tricky even with tracing is that I think for us it's very important to be able to tell what kind of pressure GraphQL exerts on our systems, especially data stores. So, even... Especially with things like Data Loader that might fetch data from a different context in a specific field. It's sometimes you can look at a field and see that it's resolving very fast, but maybe it's enqueuing a lot of data to be loaded later. And if you don't observe this well, it can be really tricky to find the root cause of increased writes or reads on our database. So, for us, we noticed the thing that gave us the most bang for our buck was looking at external calls that GraphQL queries make and seeing if a certain query, for example, has started making 500 MySQL queries because something was not optimized or we're missing a data load or somewhere. So, I think, yeah, having observability into external calls that are coming out of GraphQL, even more than just timing GraphQL resolvers themselves, have been really helpful.

Yeah, I mean, that makes a ton of sense because with that I.O., eventually somewhere that has to happen. And so, if you don't have the right checks in place, the bottleneck might end up and it might affect other parts of the system, which is more tricky, right? Like, it starts affecting, creating latency for other requests, kind of like a red herring situation. So, yeah, that's really invaluable information. Thank you. Awesome. So, we've talked a little bit about kind of like keeping systems reliable and downtime. Let's take it to a different route and go a little more positive, which is, how do you handle like large spikes in traffic? Ideally, this is something kind of that we want, right? We want to see more traffic coming to our APIs. But, at the same time, especially in situations where you might not know that that traffic is coming, what are your preferred methods for handling that? And I'm going to start with Mandy this time and then just kind of jump around.

Well, I think you're probably going to start noticing a recurring theme in some of my answers because I'm a big fan of observability tooling with respect to GraphQL APIs. So, that's definitely a really important to have in place because it can help you identify patterns as they begin emerging rather than being surprised by something that you don't need to be surprised by. And also with those kinds of tools in place, you can configure them to give you alerts to when something unexpected is happening in your system whether it's around errors or a big spike in traffic and give you a heads up that there might be something that you want to deal with preemptively before something bad happens. I like that. So, kind of swooping in before that spike happened, just knowing, once you get outside of the bounds of normalcy for your API, being alerted to that so that you can react immediately. Yeah, something along the lines of an ounce of prevention is worth a pound of cure, right? Yeah. Yeah.

7. Strategies for Dealing with Spikes in Traffic

Short description:

Tejas discusses strategies for dealing with spikes in traffic, including scaling horizontally, server-side throttling, client retries, and using an L7 proxy application gateway. It's important to prioritize a degraded service over no service at all. Mark doesn't have much to add.

I love it. Amazing. So Tejas, let's hear from you. What are some of your favorite ways of dealing with spikes in traffic?

Sure. Yeah. This happens all the time. It's impossible to anticipate problems or suddenly you launch a new service. For instance, if you launch Stranger Things, it's hard for us to anticipate how much more people can be involved in watching that show. So there are a few strategies that we have tried using in the past.

So one is, GraphQL server is usually kind of stateless proxy, like I mentioned earlier. So it's easy to scale it horizontally. However, we do have to ensure that our downstream service can handle that load as well. Obviously, scaling horizontally will work for the GraphQL server itself. If everything is set up downstream, then we can go ahead and do that.

The other one, obviously, server-side throttling and client retries are a great toolkit to have as a backup in case you really need to go down that route. Because it gives you that... It gives you a little bit worse experience for the user. But at least, you're still up and running. And that can be something that's a great one to use. And for like, let's say, the traffic is malignant. We have an L7 proxy application gateway that we can use so that it can really reject those requests upfront. So yeah, those are some strategies we use.

I like that. I really like how you touched on degraded service is better than no service at all. It might not be as snappy as you want. It might have to make a couple of retries but that's better than the entire system being unavailable. So that's really cool. I like that a lot.

What about you, Mark? You got anything you want to add here? I don't have much to add. Those were amazing answers from Mandy and Tejas.

8. Caching Strategies and Scaling GraphQL

Short description:

Scaling horizontally and protecting data stores behind downstream services are crucial for us. Rate limits help us deal with timeouts and prevent services from crumbling. Caching plays a vital role in scaling GraphQL, and at GitHub, we rely on a custom data loader implementation and server-side application-level caching. Netflix focuses on consistent data and uses caching strategically based on factors like consistency, availability, and performance.

I think it's very similar for us. Scaling horizontally does wonders. I do think, again, for us, it's protecting those, especially data stores, behind downstream services. That's the most important. And that's a really good point that Tejas made. Returning a 422 rate limited to a client might be just as bad of an experience for them than a 500, but for us, it's the difference between dealing with timeouts and possibly our services crumbling so we can get back up way earlier by having rate limits in place. I like that. That's cool. That's awesome.

I guess we'll stick on this same topic here just a little bit more, which actually, if we're dealing with spikes in traffic, we're talking about being able to scale horizontally and downstream services being able to support that. And I feel like that's a conversation you can't have without mentioning caching, right? Caching is pretty much imperative all throughout your systems, but it kind of is a different story than maybe many folks are used to dealing with REST APIs. Only slightly, but different enough. Do you want to maybe touch, and Mark, I'll let you kick this one off. What are some caching strategies that y'all implement at GitHub and how does that allow you to scale GraphQL? Yeah, so we don't do anything super magic when it comes to caching. We rely heavily on our kind of custom data loader implementation to make sure we never repeat queries or avoid N plus one queries. I think we focus a lot on server-side application-level caching. So maybe we let teams decide. So if there's a field that's very expensive to compute that can allow for some staleness, we'll cache it there in a resolver. But generally, Data Loader does wonders for us. We can do HTTP caching, but it doesn't give us as much as with the REST API, for example. So we focus a lot on the application level caching on our application servers. Yeah. That makes a lot of sense.

Tejas, what about you? How does Netflix handle caching with GraphQL? Sure. Yeah. Currently, we are using GraphQL on the studio side of the ecosystem, which is highly ... It's an enterprise system almost. So we want to have extremely consistent data all the time for most of our workflows. I think deciding when to use caching, you need to ask a lot of different questions, especially on the server side. What is your appetite for losing consistency versus availability performance? Or are we using it to improve performance or fallback? On the streaming side, the consumer side of Netflix, we use tons of fallback logic because availability is of paramount importance.

9. Caching Strategies and Reliable API Resources

Short description:

We use client-side caching techniques like the relays global object identification spec and subscriptions to ensure data consistency. Another lesser-known caching strategy is persisted queries, specifically automatic persisted queries, which allow for smaller requests and the use of CDNs. Shifting from POST to GET requests enables the use of HTTP caching mechanisms. For resources on building reliable APIs, I highly recommend Marc Andre's book, Production Ready GraphQL.

But we don't use GraphQL there yet. So maybe in the future, we might have better ideas around that. So on the studio side, we also do a lot of client-side caching when we find it super useful. We use the relays global object identification spec as a way to invalidate or as a way to re-fetch the data, just specific data that you have invalidated and subscriptions as a way to know that that data has changed. So those are some techniques we do, but mostly on the client side for the studio because the consistency of the data is so important.

Yeah, for sure. That makes a lot of sense. And Manny, I'm sure this is something you deal with quite a lot, working with a lot of different companies. And they all probably need different caching strategies, so I'd be curious to hear your answer. So I'd say one of my favorite, perhaps lesser-known ways, of caching in a GraphQL API is around persisted queries and the Paul's specifically automatic persisted queries. And what this does is essentially you send a hash of your query string as your request to your GraphQL API. And after it's been seen, once that query string, that hash query string, is cached, which means you're ultimately sending a much smaller request to your GraphQL API, which is a win in and of itself. But it also makes it possible and more feasible anyways to send your GraphQL requests in the form of GET requests, which, in turn, makes it more feasible to use something like a CDN with your GraphQL API. So that's one of my favorite ones.

Yeah, that's really cool. Yeah, because then by shifting from POST to GET, that opens the door for, pretty much, all the HTTP caching mechanisms that work so well for handling GET requests. And a lot of that just comes naturally at that point, right? If you're just using proper TTLs, that could be your caching, essentially.

For sure. Yeah, awesome. I love it. So, let's see, we're coming up, getting kind of close. But this one I wanted to leave for a little bit of time. This is one of my favorite questions to ask for any panel. And I'll start with you, Mandy. And also, like, shameless plugs are more than welcome here. But do you have any resources that you recommend for folks about building reliable APIs? Like, this is a topic that everyone can benefit from. Where would you recommend folks go to learn more about this topic? I think my number one recommendation would be Marc Andre's book, Production Ready GraphQL. That book is amazing. Everything you'd want to know about building a production-ready GraphQL API all in one place, condensed in one book, definitely go check it out. Awesome.

10. Resources for Building Resilient GraphQL APIs

Short description:

Starting with reading about distributed systems and resiliency is crucial for building resilient GraphQL APIs. Tejas recommends exploring talks from conferences like InfoQ and Googling 'architecting for failure' for high-level insights. Additionally, the O'Reilly book 'Site Reliability Engineering' and resources on domain-driven design provide valuable knowledge on handling failures and working with distributed systems.

I love it. Yeah, Marc, let's follow up with you. You got anything? Thanks, Mandy. Mandy also has a book, so go check that out as well. It's great. I think besides – so as far as the panel goes, I think really important resources here is just reading about distributed systems in general and resiliency. I think a lot of what we've talked today has some GraphQL specifics, but also at its core is how to build resilient systems and distributed systems. So I don't have specific resources in mind, but I think if you want to read about something on how to make your GraphQL API more resilient, starting there would be amazing.

For sure. I love it. Thank you. Tejas, do you have anything you want to add? Yeah. I'm going to double down on what Mark said. I think the key here is whether it's a GraphQL API, REST API, the important part is you build resilient systems. My colleagues at Netflix have done – there's a whole resilience team and they've done some incredible – they've shared their ideas on conferences like InfoQ and things like that where there are some excellent talks there. And there's not just from Netflix, but other companies as well. I highly recommend Googling architecting for failure or something like going to the InfoQ website and finding some great talks there to learn. They usually tend to be high level, but there are some really neat ideas in there that you can bring back and apply to your distributed systems back home.

For sure. That's amazing. And I'll piggyback on this and add a couple of recommendations of my own. There's the O'Reilly book, Site Reliability Engineering. Again, it talks on a lot of these topics like about observability and understanding how to handle failures and best practices for dealing with these types of situations, understanding scaling. So it's a really good book. And it touches generally across like, you know, this is what I'm looking for. Distributed systems. There we go. Yeah. And the other one that I actually recommend, I think a lot of people kind of skip out on is, because it is distributed system, but like anything on domain-driven design can be really invaluable resources when wanting to work with distributed systems like this or larger scale systems when you need to have resiliency and there's a lot of connecting pieces. It's really good in helping you kind of understand those relationships between those given systems.

QnA

Testing Strategies for GraphQL APIs

Short description:

Discussion on the importance of testing GraphQL APIs for reliability and smooth operation, including unit testing, integration testing, and observability. Strategies for testing changes without disrupting existing clients and the significance of error handling and smoke tests.

And without further ado, let's go ahead and get into some of these questions. So I'm gonna start it off with the first one, which is just like, it's actually pretty hard to know that a service is reliable without testing. So what types of testing do you find to be the most important when it comes to keeping a GraphQL API up and running smoothly? And I guess, yeah, I'll just lead off with Mark. We'll just go through again and we'll switch up the rounding later. Sure. Yeah, so this is an excellent question. I think it's good to acknowledge right off that testing a GraphQL API is kind of difficult by the nature of GraphQL, right? There's almost an infinity of possibilities of how people could query your graph, which is a feature, but also something that makes it hard to test every possibility. There's also the fact that your resolvers that are backing your graph, often use contextual information from that kind of infamous context argument. So if you only focus on testing the graph itself, it's very hard to be confident in the backing logic behind the graph. So I think my main thing, and I think I've said that a lot before, but I think focusing on extracting your logic outside of a GraphQL layer, makes it so much easier to test that your domain logic is well-tested, and then you can focus on testing the GraphQL part separately. So we like to focus on more integration testing for GraphQL, test the most, as many different queries we can that represent client use cases, and test all our middleware layers, so our rate limiting, we'll talk about this more I'm sure, but everything that's with error handling on the GraphQL side. So that'd be my advice. Yeah, awesome, thank you. Mandy, you have a follow-up to that. What areas do you find that like testing to be the most beneficial in helping API stay running smoothly? So in terms of testing, there's another way you can think about it too with GraphQL APIs where because they're evolutionary in nature, you wanna make sure that as you're releasing new features on your graph, that you're doing so in a way that doesn't cause breaking changes for existing clients. So your observability tooling is a really important part of that story. So what that means is in practice, you'd probably be collecting operation traces and making sure that your clients identify themselves when they're using your graph so that when you push those changes, you can check against those operation changes or operation traces, and make sure that you're releasing changes to your graph in such a way that's not going to break queries that are being currently made by existing consumers of your graph. I love it, yeah, that's a whole other avenue to think of, which is, aside from the actual nuts and bolts testing, is the observability, seeing what's happening in an existing system and using that as a baseline for testing as well. It's really cool. Tejas, do you have a follow-up? How do you all handle testing your APIs in Netflix? Sure, I'm sorry, I was muting. Sure, I'm going to reiterate some of the same things. Mark and Mandy mentioned, I really like unit testing for the code that is inside the logic, inside the resolvers and data loaders. It's nice to kind of separate that out. Integration test is really great for testing context passing between the parent and child data fetchers or to data loaders, et cetera. And then smoke tests, we use a lot actually, and we find them extremely useful for end-to-end testing of your GraphQL queries. And this, we do both with and without mocking. We really find that super useful. And then error handling is another one. So you want to trigger error scenarios using mocks. So instead of, because not all the time you're going to be able to test that behavior well, so test it end-to-end.

Strategies for Handling GraphQL Failures

Short description:

Discussion on strategies for handling GraphQL failures, emphasizing the importance of testing error paths, reducing downtime, and implementing observability for quick issue resolution.

And then for production, we have two other ideas I can share that we like to use. We like to replay the actual read queries from production. This one doesn't work well for mutation, but it at least gives you a little bit of extra confidence that they're always up and running. And then also canarying, if that works, subset of traffic before you promote the new code to product. So those are some strategies we use. Yeah, all make a ton of sense. It's just awesome too. I love the focus you put on testing error paths. It's like oftentimes we might find ourselves getting into traps where we're testing the happy path and making sure that works, but it doesn't really explain what happens like the scenario of when that test fails. Like what does the error state and what will we experience there? So, yeah, it's also very important.

I actually, so this brings me into a really good next question. And so then we'll actually, I'll lead up with you Tejas, if you don't mind. So how do you handle situations when let's say something was able to slip through, testing didn't catch a scenario and you're having GraphQL failures. Like, or how do you prevent that? Like when a system starts encountering errors, how do you prevent too much downtime? I think you might need it again. Here we go again. Yeah, so when APIs go down, we wanna focus on reducing the mean time to resolution. So we wanna be up and running quickly again. This could include what is the fastest action we wanna take to make the systems come back online. This is maybe a rollback or it could be bouncing the server instances if there's some issue with the system itself.

So to do that, we need like great observability to our system and distributed tracing and metrics are key here for us. So let's say, we assume the issues in GraphQL server, at the end of the day, they're just regular servers and we can apply the same practices to them. So we wanna have everything from a memory leak to running out of threads, all the metrics there so that we can actually quickly track down what's going on and take appropriate action based on that. And then to prevent some of these from happening, in the future GraphQL API, for most people, they tend to implement as a one graph.

Enhancing GraphQL Service Resilience

Short description:

Strategies for enhancing GraphQL service resilience by minimizing logic in GraphQL service, leveraging federation for impact reduction, implementing GraphQL-Aware IPC metrics for error differentiation, and employing functional sharding for stateless service resilience.

So it can be a single point of failure for your business. It can, when it goes down, it can cause serious problems. So we wanna not do if possible, there are three strategies I'm gonna cherry pick from. One, we don't wanna do too much in the GraphQL service, it's meant for data fetching and data loading, and try to avoid as much logic. So the surface area for failure decreases. And then for federation, which we do at Netflix, we can apply some of these tactics at the Gateway layer, because the federated services themselves don't have as much as an impact on the overall graph, assuming they're spread out. So that's really helpful to make your service more resilient. Another one we have found extremely useful is GraphQL-Aware IPC metrics. So generally, in GraphQL, you have different kind of errors, but they all show up as 200, right? And most of the time your response is going to be 200. So you wanna be able to take those 400s and 500s that we had in REST and, you know, kind of set a different status and emit that status to our metric server. So we look at the, we look at the introspect the response that we are sending back and emit custom metrics based on what the error might be in the errors block or the data block. And then one last strategy I'll share, you know, this, we can go all day about this, but another one is, you know, GraphQL is a stateless service so another strategy to improve resilience is functional sharding. Then we tend to use that across with a lot of places in Netflix, and, you know, basically, for example, you wanna have a separate fleet of instances for something like subscriptions, versus, you know, querying and mutation so that you can actually kind of separate, because they have different semantics on the server. You know, subscriptions is long-lived connections while querying and mutation might be quick running, so you can separate failure based on that. And also, you might also apply this tactic to different kinds of callers, for instance, your users who are high priority users, versus, like, maybe some kind of internal backend application calling the app, so those are some strategies we use.

That's really cool. And another thing too that really helps, like, at least separating subscriptions from queries and mutations is just, like, a scale thing. Like you said, you know, web sockets are long-lived connections and have, like, different requirements as far as hardware goes, you know, you need more memory and stuff like that than you would for most of your query and subscriptions. So just by separating those might allow some independent scale as well, which is nice.

Handling System Failures and Traffic Spikes

Short description:

Methods for handling system failures through distributed tracing, understanding GraphQL pressures on systems, and addressing spikes in traffic with observability tooling.

So, yeah, moving on, Mandy, what do you think? Like, what are some of your preferred methods for one, either dealing with failed systems, or, like, two, yeah, like, making sure that you're in a place where they're not failing for at least too long? So, just building off of what Teja said around distributed tracing, I think this is really key because it's hard to troubleshoot some kind of system failure when you're just taking wild shots in the dark. And one of the things about GraphQL is that it is very often incorporated in some kind of distributed system, whether it's a monolithic GraphQL API in front of a bunch of microservices, or if it's part of your BFF strategy, or if you're using federation. There's probably some number of services in a service graph, and if you're trying to troubleshoot what's gone wrong at any given moment, trying to do that just using like system level logs within the context of a given service or high level metrics probably isn't gonna get you where you want to go. So having the context of distributed traces in place can really help shorten the meantime to recovery for any particular failure in your system.

Yeah, absolutely. Awesome. Mark, do you have anything you want to add on to this? What does GitHub do when it comes to preventing downtime? Yeah, I'll say tracing is absolutely great. We use that as well. I think one thing I'll add that is a little tricky even with tracing is that it's, I think for us, it's very important to be able to tell what kind of pressure GraphQL exerts on our systems, especially data stores. So even, especially with things like Data Loader that might fetch data from a different context on a specific field. It's sometimes you can look at a field and see that it's resolving very fast, but maybe it's enqueuing a lot of data to be loaded later. And if you don't observe this well, it can be really tricky to find the root cause of like increased writes or reads on our database. So for us, we noticed the thing that gave us the most kind of bang for our buck was looking at external calls that GraphQL queries make and seeing if a certain query, for example, has started making like 500 MySQL queries because something was not optimized or we're missing a data loader somewhere. So I think, yeah, having observability into external calls that are coming out of GraphQL, even more than just timing GraphQL resolvers themselves have been really helpful.

Yeah, I mean, that makes a ton of sense with that I-O eventually somewhere that has to happen. And so if you don't have the right checks in place, the bottleneck might end up and it might affect other parts of the system, which is more tricky, right? Like it starts affecting, creating latency for other requests, kind of like a red herring situation. So yeah, that's really invaluable information. Thank you. Also, so we've talked a little bit about kind of like keeping systems reliable and downtime. Let's take it to a different route and go a little more positive, which is how do you handle large spikes in traffic? Ideally this is something kind of that we want, right? We want to see more traffic coming to our APIs, but at the same time, especially in situations where you might not know that that traffic is coming, what are your preferred methods for handling that? And I'm gonna start with Mandy this time and then just kind of jump around. Well, I think you're probably gonna start noticing our recurring theme in some of my answers because I'm a big fan of observability tooling with respect to GraphQL APIs. So that's definitely a really important to have in place because it can help you identify patterns as they begin emerging rather than being surprised by something that you don't need to be surprised by. And also with those kinds of tools in place, you can configure them to give you alerts to when something unexpected is happening in your system, whether it's around errors or big spike in traffic, and give you a heads up there might be something that you want to deal with preemptively before something bad happens.

Strategies for Handling Traffic Surges

Short description:

Dealing with spikes in traffic through horizontal scaling, service-side throttling, and client retries, along with the importance of rate limits for system resilience.

I like that. So kind of swooping in before that spike happened, just knowing once you get outside of the bounds of normalcy for your API, being alerted to that so that you can react immediately. Yeah, something along the lines of an ounce of prevention is worth a pound of care, right? Yeah, yeah, I love it. Amazing, so Tejas, let's hear from you. What are some of your favorite ways of dealing with spikes in traffic? Sure, yeah, this happens all the time. It's impossible to anticipate problems that or suddenly you launch a new survey. For instance, if you launch stranger things, we saw, it's hard for us to anticipate how much more people can be involved in watching that show. So there are a few strategies that we have tried using in the past. So one is, GraphQL server is usually kind of stateless proxy like I mentioned earlier. So it's easy to scale it horizontally. However, we do have to ensure that our downstream service can handle that load as well. So obviously scaling horizontally will work for the GraphQL server itself. If everything is set up downstream, then we can go ahead and do that. The second one, obviously Service Side Throttling and Client Retries are a great toolkit to have as a backup in case you really need to go down that route because it gives you a little bit worse experience for the user, but at least you're still up and running. And that can be something that's a great one to use. And for like, let's say if the traffic is malignant, we have an L7 proxy application gateway that we use so that you can really reject those requests upfront. So yeah, those are some strategies we use.

I like that. I really like how you touched on degraded service is better than no service at all, right? It might not be as snappy as you want, it might have to make a couple retries but that's better than the entire system being unavailable. So that's really cool. I like that a lot. What about you, Mark? You got anything you wanna add here? I don't have much to add. These were amazing answers from Mandy and Tejas. I think it's very similar for us. We can, scaling horizontally does wonders. I do think, again, it's for us, it's protecting those, especially data stores behind downstream services, that's the most important. And that's a really good point that Tejas made. Returning a 422 rate limited to a client might be just as bad of an experience for them that are 500. But for us, it's the difference between having dealing with timeouts and possibly our services crumbling. So we can get back up way earlier by having rate limits in place.

Caching Strategies in GraphQL Implementations

Short description:

Dealing with caching strategies in GraphQL implementations for horizontal scaling and server-side application level caching to ensure query efficiency and avoid repetition.

I like that. That's cool. Yeah, that's awesome. Well, I guess we'll stick on this same topic here, just a little bit more, which actually, so if we're dealing with spikes in traffic, we're talking about being able to scale horizontally and downstream services being able to support that. And I feel like that's a conversation you can't really have without mentioning caching, right? Caching is pretty much imperative all throughout your systems. But it kind of is a different story than maybe many folks are used to dealing with like REST APIs. Only slightly, but different enough. Do you wanna maybe touch, and Mark, I'll let you kick this one off. What are some caching strategies that y'all implement at GitHub and how does that allow you to scale GraphQL? Yeah, so we don't do anything super magic when it comes to caching. We rely heavily on our kind of custom data loader implementation to make sure we never repeat queries or avoid N plus one queries. I think we focus a lot on server side application level caching. So maybe we let teams decide, so if there's a field that's very expensive to compute that can allow for some staleness, we'll cache it there in a resolver. But generally data loader does wonders for us. We can do HTTP caching, but it doesn't give us as much as with the rest API, for example. So we focus a lot on the application level caching on our application servers.

Yeah, that makes a lot of sense. Tejas, what about you? How does Netflix handle caching with GraphQL? Sure, yeah. So currently we're using GraphQL on the studio side of the ecosystem, which is a highly, it's an enterprise system almost. So we want to have extremely consistent data all the time for most of our workflows. So, I think deciding when to use caching, you need to ask a lot of different questions, especially on the server side. What is your appetite for losing consistency versus availability, performance, or are we using it to improve performance or a fallback? On the streaming side, the consumer side of Netflix, we use tons of fallback logic because availability is of paramount importance, but we haven't, you know, we don't use GraphQL there yet. So maybe in the future, we might have better ideas around that. So on the studio side, we also do a lot of client side caching when we find it super useful. We use the Relay's global object identification spec as a way to invalidate or as a way to get, you know, refresher the data, just specific data that you have invalidated and subscriptions as a way to know that that data has changed so those are some techniques we do, but mostly on the client side for the studio because the consistency of the data is so important.

Innovative Caching Techniques and API Resources

Short description:

Discussing innovative caching techniques in GraphQL APIs, emphasizing persisted queries for efficiency and CDN compatibility, recommended resources for building reliable APIs.

Yeah, for sure. That makes a lot of sense. And Mandy, I'm sure this is something you deal with quite a lot working with a lot of different companies, you know, and they all probably need different caching strategies. I'm very curious to hear your answer. So I'd say one of my favorite, perhaps lesser known ways of caching in a GraphQL API is around persisted queries and the Paul specifically automatic persisted queries. And what this does is essentially you send a hash of your query string as your request to your GraphQL API. And after it's been seen once that query string, that hash query string is cached, which means you're ultimately sending a much smaller request to your GraphQL API, which is a win in and of itself. But it also makes it possible and well, more feasible anyways, to send your GraphQL requests in the form of get requests, which in turn makes it more feasible to use something like a CDN with your GraphQL API. So that's one of my favorite ones.

Yeah, that's really cool. Yeah, because then by shifting from POST to GET, that opens a door for pretty much all the HTTP caching mechanisms that work so well for handling GET requests. And I mean, a lot of that just comes naturally at that point, right? Like if you're just using proper TTLs, like that could be your caching essentially. That's really good. For sure. Yeah, awesome, I love it. So let's see, we're coming up, getting kind of close, but this one I wanted to leave for a little bit of time. This is one of my favorite questions to ask for any panel. And I'll start with you, Mandy, and also like shameless plugs are more than welcome here. But like do you have any resources that you recommend for folks about building reliable APIs? Like, this is a topic that everyone can benefit from. Where would you recommend folks go to learn more about this topic? I think my number one recommendation would be Marc Andre's book, Production Ready GraphQL. That book is amazing. Everything you'd wanna know about building a production ready GraphQL API all in one place, condensed in one book, definitely go check it out. Awesome, I love it. Yeah, Marc let's follow up with you. You got anything? Thanks, Mandy. Mandy also has a book, so go check that out as well. It's great. I think besides, so as far as the panel goes, I think really important resources here is just reading about distributed systems in general and resiliency. It's, I think a lot of what we've talked today has some GraphQL specifics, but also at its core is how to build resilient systems and distributed systems. So I think I would, I don't have specific resources in mind, but I think if you wanna read about something on how to make your GraphQL API more resilient, starting there would be amazing.

Resilience in API Development and Error Handling

Short description:

Discussing resilience in building GraphQL and REST APIs, recommending resources like Site Reliability Engineering and domain-driven design for distributed systems, and strategies for handling errors in GraphQL systems.

For sure, I love it, thank you. Tejas, do you have anything you wanna add? Yeah, I'm gonna double down on what Mark said. I think the key here is whether it's a GraphQL API, REST API, the important part is you build the resilience systems. My colleagues at Netflix have done, there's a whole resilience team and they've done some incredible, they've shared their ideas on conferences like InfoQ and things like that where there's some excellent talks there and there's not just from Netflix but other companies as well. I highly recommend googling architecting for failure or something like go into the InfoQ website and finding some great talks there to learn. They usually tend to be high-level but there are some really neat ideas in there that you can bring back and apply to your distributed systems back home.

For sure. Yeah, that's amazing. And I'll piggyback on this and add a couple recommendations of my own. There's the O'Reilly book, Site Reliability Engineering. Again, it talks on a lot of these topics like about observability and understanding how to handle failures and best practices for dealing with these types of situations, understanding scaling. So it's a really good book and it touches generally across what I'm looking for. Systems, distributed systems. There we go. Yeah, and the other one that I actually recommend, I think a lot of people kind of skip out on is because it is distributed system, but like anything on domain driven design can be really invaluable resources when wanting to work with distributed systems like this or larger scale systems when you need to have resiliency and there's a lot of connecting pieces, it's really good at helping you kind of understand those relationships between those given systems. So, yeah, those would be my recommendations as well.

It actually looks like we have a couple minutes left. So, I'm gonna try and squeeze in a question or two from Discord. We'll kind of do like a speed round of questions here real quick. I think I can get one in. And so one of them is going to be, yeah, okay, so this is a great one. What are some of your, and we touched on this just a hair, but like best practices for handling errors? It's a very important part of reliable systems, but like what are the things that you do to like make errors more usable within GraphQL and yeah, how do you use it, take advantage of them? And sorry, yeah, Mark, why don't we jump in with you? Sure, I think Dave just mentioned something earlier. I forget what the name of the thing was, but to kind of translate GraphQL errors to a status code for kind of better observability. And I think that's one common mistake that if you don't think about, you can really get bit by is, so if you do go with GraphQL specific errors, any existing observability systems you use might not be able to catch errors like 400s in a REST API. So like validation errors, it's very hard to know when a client is having trouble with GraphQL unless you specifically watch for it. So I would say, don't be afraid to build some custom observability for your domain errors and just errors that aren't necessarily like server errors, but, hey, this client hasn't been able to, I don't know what your domain is, but to check out or pay for something because it's hitting errors. Got it, yeah, that's really good. Mandy, what about you, how do you like to handle errors in GraphQL systems? Well, I think one thing that would be worthwhile to highlight here is just some of the interesting ways we can approach error handling with a GraphQL API that aren't necessarily available to us with other forms of APIs. Specifically I'm referring to the errors as data approach, which can give us better ways to communicate errors to client developers. So rather than simply sending a top-level error, we can actually use something like a union, for instance, and use that union to express different states of what the data response is.

Handling Errors as Data in GraphQL Systems

Short description:

Discussing the use of errors as data in GraphQL systems to enhance user experience, differentiating between true errors and expected states. Recommendations for incorporating errors as data for improved client developer support and user communication. Insights on balancing errors block and errors as data approaches in a federated system like Netflix.

So this can be particularly useful when you're not dealing with something that's completely unexpected, like something that's a true error versus something that might be expected, like a user sends a mutation to update their username and that username is already taken. That's an expected state that we might anticipate running into at some point in our system. So one really good guideline I heard around this that I've really taken to heart is that when you're trying to engineer some aspect of the user interface around a particular error state, then that might be a good place to incorporate errors as data rather than just relying on those top level GraphQL API errors. This can really improve the experience for both the client developers working on your application as well as how they can support users and communicate those errors to them in the user interface.

Yeah, I love that. And I believe Sasha Solomon gave a talk about this kind of, this particular topic as well. So folks, if you're interested more in that approach, you definitely go check that out as well. Tejas, did you have a follow-up on that? Yeah, I was going to suggest the same idea using errors as data. The only thing I can add onto that is at Netflix, we do kind of federation, right? So we have different kind of, a lot of different domains and some domains are using the errors block approach while some are using errors as data approach. So, you know, both can kind of live together assuming, you know, there is certain parts of the graphs that are potentially mutually exclusive. But we found both advantages and pros and cons for both approaches. And it's best to look at your data and decide how you want to go with that.

Absolutely. And so this is kind of bringing us up on time. I believe we've got close to like two minutes left. So I'm just going to go around real quick. And Mandy, one last time, where can everyone find you? Maybe if they have more follow-up questions or anything like that, where should folks look for you? The best place to find me would be Twitter, I'm Mandy Wise and that's Mandy with an I on Twitter. Yeah, feel free to send me a message there. I'd love to chat. Awesome. And Marc, what about you? What's the best way for folks to reach you? Twitter is good. I think one other thing is we have this thing called GraphQL FM. Tony Guetta from Twitch and I have this podcast and we're actually having pages very soon to talk about some GraphQL stuff. So check this out. Awesome. I was gonna shameless plug you on that GraphQL FM if you didn't do it yourself. So I'm glad you dove in there with it. And Tejas, where can folks find you? Yeah, Twitter is good. My handle is Tejas, my first name 26. So you can shoot me a message there or...

Handling Errors and Errors as Data in GraphQL

Short description:

Handling errors in GraphQL requires careful consideration. Translating GraphQL errors to status codes can improve observability. Building custom observability for domain errors can provide valuable insights. The errors as data approach in GraphQL allows for better communication of errors to client developers. Using unions to express different states of the data response can be beneficial. Incorporating errors as data in the user interface can enhance the user experience. Sasha Solomon has given a talk on this topic. Tejas also suggests using errors as data.

So yeah, those would be my recommendations as well. It actually looks like we have a couple of minutes left. So I'm going to try and squeeze in a question or two from Discord. We'll kind of do like a speed round of questions here real quick. I think I can get one in. And so one of them is going to be, yeah. Okay. So this is a great one. What are some of your, and we touched on this just a hair, but like best practices for handling errors. It's a very important part of reliable systems, but like what are the things that you do to like make errors more usable within GraphQL? And yeah. How do you use it, take advantage of them? Sorry, Mark, why don't we jump in with you? Sure. I think Dave just mentioned something earlier. I forget what the name of the thing was, but to kind of translate GraphQL errors to a status code for kind of better observability. And I think that's one common mistake that if you don't think about, you can really get bit by is so if you do go with GraphQL specific errors, any existing observability systems you use might not be able to catch like errors like 400 in a REST API. So like validation errors, it's very hard to know when a client is having trouble with GraphQL unless you specifically watch for it. So I would say don't be afraid to build some custom observability for your domain errors and just errors that aren't necessarily like server errors. But hey, this client hasn't been able to I don't know what your domain is, but to check out or pay for something because it's hitting errors. Got it. Yeah, that's really good. Mandy, what about you how do you like to handle errors in GraphQL systems? Well, I think one thing that would be worthwhile to highlight here is just some of the interesting ways we can approach error handling with the GraphQL API that aren't necessarily available to us with other forms of APIs and specifically I'm referring to the errors as data approach, which can give us better ways to communicate errors to client developers. So rather than simply sending like a top-level error, we can actually use something like a union for instance and use that union to express different states of what the data response is. So this can be particularly useful when you're not dealing with something that's completely unexpected, like something that's a true error, versus something that might be expected, like a user sends a mutation update their username and that username is already taken. That's an expected state that we might anticipate running into at some point in our system. So one really good guideline I heard around this that I've really taken to heart is that when you're trying to engineer some aspect of the user interface around a particular error state, then that might be a good place to incorporate errors as data, rather than just relying on those top-level GraphQL API errors. This can really improve the experience for both the client developers working on your application, as well as how they can support users and communicate those errors to them in the user interface. Yeah, I love that. And I believe Sasha Solomon gave a talk about this particular topic as well. So folks, if you're interested more in that approach, definitely go check that out as well. Tejas, did you have a follow-up on that? Yeah. I was going to suggest the same idea using errors as data.

Appreciation and Closing Remarks

Short description:

Expressing gratitude to panelists and audience for valuable insights shared in the conversation. Closing remarks and appreciation for participation in the engaging discussion.

Awesome, cool. And you can find me on Twitter as Kirk Kempel. Also happy to answer any questions that you all might have. Once again, thank you to our panelists for joining us. This is a great conversation. I learned a ton already, invaluable stuff for the community. So thank you for coming on and doing this. We really appreciate you. Thank you so much.

Yeah, and I think that's gonna bring us up. We probably got like a minute left here where we can all just kinda chit chat now because we can't really take any more questions. Yeah. Thanks everyone.

Yeah. Thanks everybody. Goodbye.

Federation and Contact Information

Short description:

At Netflix, we use a combination of the errors block approach and errors as data approach in our federation. Both approaches have their advantages and disadvantages, so it's important to evaluate your data and make an informed decision. As we're running out of time, the panelists share their contact information. Mandy can be found on Twitter as Mandy Wise, Mark and Tony have a podcast called GraphQL FM, and Tejas is available on Twitter as Tejas26. Thank you to the panelists for their valuable insights and the great conversation.

The only thing I can add on to that is at Netflix we do kind of federation, right? So we have a lot of different domains, and some domains are using the errors block approach while some are using errors as data approach. So both can kind of live together, assuming there are certain parts of the graphs that are potentially mutually exclusive. But we found both advantages and pros and cons for both approaches, and, you know, it's best to look at your data and decide how you want to go with it.

Absolutely. And so this is kind of bringing us up on time. I believe we've got close to like two minutes left. So I'm just going to go around real quick. And, Mandy, one last time, where can everyone find you? Maybe if they have more follow-up questions or anything like that, where should folks look for you? The best place to find me would be Twitter. I'm Mandy Wise, and that's Mandy with an I on Twitter. Yeah. Feel free to send me a message there. I'd love to chat.

Awesome. And, Mark, what about you? What's the best way for folks to reach you? Twitter is good. I think one other thing is we have this thing called GraphQL FM. Tony Guetta from Twitch and I have this podcast, and we're actually having Tejas very soon to talk about some GraphQL stuff. So check this out. Awesome. I was going to shameless plug you on that GraphQL FM if you didn't do it yourself. So I'm glad you dove in there with it. And Tejas, where can folks find you? Yeah, Twitter is good. My handle is Tejas, my first name 26, so you can shoot me a message there or... Awesome. Cool. And you can find me on Twitter as Kurt Kempel. Also happy to answer any questions that you all might have. Once again, thank you to our panelists for joining us. This is a great conversation. I learned a ton already, invaluable stuff for the community. So thank you for coming on and doing this. We really appreciate you. Thank you so much. Yeah, and I think that's going to bring us up. We've probably got like a minute left here where we can all just kind of chitchat now because we can't really take any more questions. Thanks, everyone. Yeah.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

From GraphQL Zero to GraphQL Hero with RedwoodJS

GraphQL Galaxy 2021

32 min

From GraphQL Zero to GraphQL Hero with RedwoodJS

Top Content

Tom Preston-Werner

GitHub cofounder, RedwoodJS author

Tom Pressenwurter introduces Redwood.js, a full stack app framework for building GraphQL APIs easily and maintainably. He demonstrates a Redwood.js application with a React-based front end and a Node.js API. Redwood.js offers a simplified folder structure and schema for organizing the application. It provides easy data manipulation and CRUD operations through GraphQL functions. Redwood.js allows for easy implementation of new queries and directives, including authentication and limiting access to data. It is a stable and production-ready framework that integrates well with other front-end technologies.

frameworks graphql redwoodjs builders and founders

Local State and Server Cache: Finding a Balance

Vue.js London Live 2021

24 min

Local State and Server Cache: Finding a Balance

Top Content

Natalia Tepluhina

GitLab

This Talk discusses handling local state in software development, particularly when dealing with asynchronous behavior and API requests. It explores the challenges of managing global state and the need for actions when handling server data. The Talk also highlights the issue of fetching data not in Vuex and the challenges of keeping data up-to-date in Vuex. It mentions alternative tools like Apollo Client and React Query for handling local state. The Talk concludes with a discussion on GitLab going public and the celebration that followed.

graphql vue server cache

Batteries Included Reimagined - The Revival of GraphQL Yoga

GraphQL Galaxy 2021

33 min

Batteries Included Reimagined - The Revival of GraphQL Yoga

Uri Goldshtein

Founder of The Guild, the largest open source group in GraphQL ecosystem.

Envelope is a powerful GraphQL plugin system that simplifies server development and allows for powerful plugin integration. It provides conformity for large corporations with multiple GraphQL servers and can be used with various frameworks. Envelope acts as the Babel of GraphQL, allowing the use of non-spec features. The Guild offers GraphQL Hive, a service similar to Apollo Studio, and encourages collaboration with other frameworks and languages.

graphql react server components

Rock Solid React and GraphQL Apps for People in a Hurry

GraphQL Galaxy 2022

29 min

Rock Solid React and GraphQL Apps for People in a Hurry

Ryan Chenkie

Founder @ CourseLift

The Talk discusses the challenges and advancements in using GraphQL and React together. It introduces RedwoodJS, a framework that simplifies frontend-backend integration and provides features like code generation, scaffolding, and authentication. The Talk demonstrates how to set up a Redwood project, generate layouts and models, and perform CRUD operations. Redwood automates many GraphQL parts and provides an easy way for developers to get started with GraphQL. It also highlights the benefits of Redwood and suggests checking out RedwoodJS.com for more information.

react graphql

Adopting GraphQL in an Enterprise

GraphQL Galaxy 2021

32 min

Adopting GraphQL in an Enterprise

Shruti Kapoor

Lead Front End Engineer @ Slack

Today's Talk is about adopting GraphQL in an enterprise. It discusses the challenges of using REST APIs and the benefits of GraphQL. The Talk explores different approaches to adopting GraphQL, including coexistence with REST APIs. It emphasizes the power of GraphQL and provides tips for successful adoption. Overall, the Talk highlights the advantages of GraphQL in terms of efficiency, collaboration, and control over APIs.

graphql enterprise

Step aside resolvers: a new approach to GraphQL execution

GraphQL Galaxy 2022

16 min

Step aside resolvers: a new approach to GraphQL execution

Benjie

GraphQL Technical Steering Committee

GraphQL has made a huge impact in the way we build client applications, websites, and mobile apps. Despite the dominance of resolvers, the GraphQL specification does not mandate their use. Introducing Graphast, a new project that compiles GraphQL operations into execution and output plans, providing advanced optimizations. In GraphFast, instead of resolvers, we have plan resolvers that deal with future data. Graphfast plan resolvers are short and efficient, supporting all features of modern GraphQL.

graphql api development

Workshops on related topic

Build a Headless WordPress App with Next.js and WPGraphQL

React Summit 2022

173 min

Build a Headless WordPress App with Next.js and WPGraphQL

Top Content

Workshop

Kellen Mace

In this workshop, you’ll learn how to build a Next.js app that uses Apollo Client to fetch data from a headless WordPress backend and use it to render the pages of your app. You’ll learn when you should consider a headless WordPress architecture, how to turn a WordPress backend into a GraphQL server, how to compose queries using the GraphiQL IDE, how to colocate GraphQL fragments with your components, and more.

next.js wordpress graphql

Build with SvelteKit and GraphQL

GraphQL Galaxy 2021

140 min

Build with SvelteKit and GraphQL

Top Content

Workshop

Scott Spence

Have you ever thought about building something that doesn't require a lot of boilerplate with a tiny bundle size? In this workshop, Scott Spence will go from hello world to covering routing and using endpoints in SvelteKit. You'll set up a backend GraphQL API then use GraphQL queries with SvelteKit to display the GraphQL API data. You'll build a fast secure project that uses SvelteKit's features, then deploy it as a fully static site. This course is for the Svelte curious who haven't had extensive experience with SvelteKit and want a deeper understanding of how to use it in practical applications.

Table of contents:
- Kick-off and Svelte introduction
- Initialise frontend project
- Tour of the SvelteKit skeleton project
- Configure backend project
- Query Data with GraphQL
- Fetching data to the frontend with GraphQL
- Styling
- Svelte directives
- Routing in SvelteKit
- Endpoints in SvelteKit
- Deploying to Netlify
- Navigation
- Mutations in GraphCMS
- Sending GraphQL Mutations via SvelteKit
- Q&A

graphql svelte

Relational Database Modeling for GraphQL

GraphQL Galaxy 2020

106 min

Relational Database Modeling for GraphQL

Top Content

Workshop

Adron Hall

In this workshop we'll dig deeper into data modeling. We'll start with a discussion about various database types and how they map to GraphQL. Once that groundwork is laid out, the focus will shift to specific types of databases and how to build data models that work best for GraphQL within various scenarios.
Table of contentsPart 1 - Hour 1 a. Relational Database Data Modeling b. Comparing Relational and NoSQL Databases c. GraphQL with the Database in mindPart 2 - Hour 2 a. Designing Relational Data Models b. Relationship, Building MultijoinsTables c. GraphQL & Relational Data Modeling Query Complexities
Prerequisites a. Data modeling tool. The trainer will be using dbdiagram b. Postgres, albeit no need to install this locally, as I'll be using a Postgres Dicker image, from Docker Hub for all examples c. Hasura

database graphql

Build and Deploy a Backend With Fastify & Platformatic

JSNation 2023

104 min

Build and Deploy a Backend With Fastify & Platformatic

Top Content

WorkshopFree

Matteo Collina

Platformatic allows you to rapidly develop GraphQL and REST APIs with minimal effort. The best part is that it also allows you to unleash the full potential of Node.js and Fastify whenever you need to. You can fully customise a Platformatic application by writing your own additional features and plugins. In the workshop, we’ll cover both our Open Source modules and our Cloud offering:- Platformatic OSS (open-source software) — Tools and libraries for rapidly building robust applications with Node.js (https://oss.platformatic.dev/).- Platformatic Cloud (currently in beta) — Our hosting platform that includes features such as preview apps, built-in metrics and integration with your Git flow (https://platformatic.dev/).
In this workshop you'll learn how to develop APIs with Fastify and deploy them to the Platformatic Cloud.

node.js cloud graphql fastify

Building GraphQL APIs on top of Ethereum with The Graph

GraphQL Galaxy 2021

48 min

Building GraphQL APIs on top of Ethereum with The Graph

Workshop

Nader Dabit

The Graph is an indexing protocol for querying networks like Ethereum, IPFS, and other blockchains. Anyone can build and publish open APIs, called subgraphs, making data easily accessible.

In this workshop you’ll learn how to build a subgraph that indexes NFT blockchain data from the Foundation smart contract. We’ll deploy the API, and learn how to perform queries to retrieve data using various types of data access patterns, implementing filters and sorting.

By the end of the workshop, you should understand how to build and deploy performant APIs to The Graph to index data from any smart contract deployed to Ethereum.

graphql ethereum api development

Hard GraphQL Problems at Shopify

GraphQL Galaxy 2021

164 min

Hard GraphQL Problems at Shopify

Workshop

5 authors

At Shopify scale, we solve some pretty hard problems. In this workshop, five different speakers will outline some of the challenges we’ve faced, and how we’ve overcome them.

Table of contents:
1 - The infamous "N+1" problem: Jonathan Baker - Let's talk about what it is, why it is a problem, and how Shopify handles it at scale across several GraphQL APIs.
2 - Contextualizing GraphQL APIs: Alex Ackerman - How and why we decided to use directives. I’ll share what directives are, which directives are available out of the box, and how to create custom directives.
3 - Faster GraphQL queries for mobile clients: Theo Ben Hassen - As your mobile app grows, so will your GraphQL queries. In this talk, I will go over diverse strategies to make your queries faster and more effective.
4 - Building tomorrow’s product today: Greg MacWilliam - How Shopify adopts future features in today’s code.
5 - Managing large APIs effectively: Rebecca Friedman - We have thousands of developers at Shopify. Let’s take a look at how we’re ensuring the quality and consistency of our GraphQL APIs with so many contributors.

case study scalability graphql