And for what it's worth, gpt40 is a chat completion model, but DeepSeek is a reasoning model. So gpt40 is already back. Here are the answers because you just ask for it. I'm done. Reasoning is still thinking through and saying, wait, let me understand the nuance of this question. And I'm going to come back. And there it did. Right? Now, immediately, this is great for me because with one prompt, I was able to run my app question on multiple models, see it, and I'm looking at it going, you know what? I think I'm going to use gpt40. Pretty nice, right?
So now next, that was productivity and ideation. How about evaluation? How does evaluation work? Now, by default, we use something called AI-assisted evaluation, which is already exposed to AI. Let me explain. When you do an app with AI, you're basically kind of writing a prompt and you're testing it manually, right? You write this prompt, you check out for the responses and so on. But when you want to test it against a large number of inputs, you need to scale it. How are you going to scale it? Natural language means that it could be anything. You scale it by creating another AI to grade the first one. They call it LLM as a judge. And so effectively what you've got is you first train one AI to answer customer questions, and then you've trained another AI to grade the first AI. And the way it does that, and I'm going to show that to you in a second, we use a technology called Prompty, but there are other ways for you to do it, is it writes what's called an evaluator. And in here, let me see if I can find that folder for you for just a second. It'll be under source. We'll take a look at what that looks like. So under here, under evaluators, I have a custom evaluator for coherence. So coherence is saying, hey, the response coming back, I want you, my AI, to use this prompt template, take the response from my chat AI, and I want to use these instructions to grade it. And the instructions are saying grade it on a scale of one to five. And here are examples of what a one looks like, what a two looks like. Use this to figure out what to grade it and give me a grading. And so if I look at this, I have an example of a custom evaluator. But I am somebody who wants to write a new custom evaluator for my app. So what I'm going to say is, hey, I want to have a new metric called the Emojiness. And what I want you to do is I want you to take this coherence as an example, and I want you to create a new evaluator for me that is going to build a rating one to five, that sees how many emojis there are in that particular response.
Comments