My name is Chris. I work for Axiom, which is the application you just saw, and in the open source world, I'm known for some contributions to CreateT3App and TRPC and some other projects.
So, how do you handle 10 million items? Well, my recommended thing to do is just to not do it. Now, of course, this is not very helpful advice, but I want to be serious here. Try to avoid it. This type of work takes a lot of time that you could spend on fixing bugs, writing new features, and so on, and it also makes your code base much more difficult for the next person to work on.
I think many of us love this idea of solving very interesting technological problems, but what you need to think about is, is this the best way you can spend your time to make your users' lives better? The way to know whether you need this or not is to listen to your customers and see what their frustrations are. The other thing that can help you figure out if you need this or not is to consistently develop against a real server with real data of a similar scale as your biggest users have.
So, if you should avoid it, how can you do that? It's going to depend on your situation, but there's many options here. You can use pagination, so getting in 10 or 20 or 100 results at a time. You can use streaming, so only getting in the specific results that are currently needed. Or maybe you don't even need the individual items, you just need to run some sort of aggregation on them. In that case, you can do it server-side, or in many cases, even in the database. And the final thing I want to point out is that if your product owner asks for this, I would really suggest to negotiate the requirements and figure out why it is that they want this. Maybe they're actually presenting you with an XY problem, and there's really a much better solution, such as one of the three things above.
But let's say you have negotiated, you have thought about this, you have considered the alternatives, and you've come to the conclusion that the only way to make your app good is to show millions of items at a time. What do you do now? The first thing that's very important is to measure before you start optimizing. There's a good chance that you're completely wrong about where your bottlenecks are. There are three main things to measure, compute, memory, and network, and we'll look at each of them later on.
So now that we're through the introduction, let's talk about the specific things that helped us out in our situation. I'm going to omit some implementation details here, so I'll show a few things that don't one-to-one represent how Axiom behaves in production. But these details are very specific to our situation, and I think we'll end up with a more useful talk this way.
Here's the starting point. It's late 2023, and tracing in Axiom works great, up until about 5000 spans. But we didn't know that last part at the time, because why would anybody ever want to create traces that big? And then we launched a new pricing plan, and people realized that they could use our tracing as effectively a profiling tool without incurring huge costs, and so they did, and we couldn't handle it. This is what happened if you tried to open a trace with even just 10,000 spans. I think you can see from the error message that we hadn't even considered this possibility or this way of failing.
After some quick investigation, we realized that we really had to rethink the entire architecture of the trace viewer. So let's look at what we did step by step to improve our capability by several orders of magnitude. Now this initial set of errors didn't actually originate in the front end.
Comments