What else can we do here? So, we think about things. GPT-4 is better at JSON, but it is slower. It does cost a bit more. And of course, because it still returns invalid JSON, we're kind of just scratching our heads, like, do we just call this thing over and over again, like 20 times until it returns valid JSON? Maybe. And we tried that a little bit, for worse.
But I want to think about kind of how to test some of these things, right? So, a typical pattern here is integration testing. We all kind of know this idea of you have your production code, it hits some sort of API, in this case, the OpenAI API, and you typically have some sort of inputs and some sort of response. Your tests typically utilize the production code to then essentially do the same idea, right? And then there's this kind of alternative approach, contract testing.
One big benefit here, specifically when you're hitting the OpenAI API, is there's some limitations you're actually paying per use. And so, if you were to run a test suite, for example, hitting the Live OpenAI API, you're going to incur a lot of cost, potentially. And of course, there's some additional latency things there. And so, that's one reason why we chose to say, hey, let's build some Mock API stuff via contract testing. And so, the idea there is, whenever you run tests, you're actually running kind of the contract, which works out pretty good for most APIs that I've used.
But an interesting fact is, we have this interesting new thing, essentially a new paradigm with AI, with some of these LLMs, these large language models out there. Yes, we do have kind of these more standard contracts of the API that we're hitting. So, we have the input over here, we also have the response. But in addition to that, we have these more dynamic pieces, where depending on what I put in this prompt, this content may look completely different. And so, that's a new paradigm, at least for me to kind of wrap my head around. And so, with that said, another kind of interesting angle here is we say, hey, the prompt is what is a bunch of numbers added together, the output is that. That works pretty good. Let's run the exact same thing again. And we see that the output is a bit different. That could be problematic when you're running a test that maybe is checking for certain things. Run it again, another thing, and so forth. So, we see that the non-deterministic nature of the AI that we're using is going to make it really hard to test some of these things. And so, we have to kind of think of what else can we do there? And of course, with any AI model, there's this question of you really don't know what you're going to get. There's a lot of ways to kind of solve for that. But long story short, the way I think of testing some of these AI things is dynamic consumer-driven contract testing. So, for what's worth, look that up if you're not sure exactly what that is. But in essence, we're able to test certain test cases, but other ones, maybe not so much.
Comments