Video Summary and Transcription
Manoj Sureddy discusses building a toolkit for prompt engineering with LLM-based solutions, emphasizing the need for a structured approach like React. The toolkit provides a structured approach for prompt development, ensuring organized and reusable templates for various LLM-based solutions. Integration with version control and CI-CD pipeline for automated evaluations, advanced quality evaluation mechanisms using Gemma, and integration of human in the loop evaluations. Focus on maintaining prompt quality, subjective metrics in evaluations, and insights on prompt drift, versioning, real user feedback, and evaluation automation.
1. Prompt Engineering Toolkit for LLM Solutions
Manoj Sureddy discusses building a toolkit for prompt engineering with LLM-based solutions, addressing challenges in developing and maintaining prompts with limited reusability and organization, emphasizing the need for a structured approach like React for prompts.
Hey, everyone. I'm Manoj Sureddy. I work as a staff software engineer at Uber. I lead the customer support automations and generative AI chatbots team. Today, I'm going to talk about how we built a prompt engineering toolkit that allows you to use LLM-based solutions at scale. With the advent of chatGBT and other LLM models, there has been an explosion of using generative AI across various products in multiple companies, and that has been the same trend for us.
And let's face it, LLMs are amazing, but when you're using them in production, not so much, because we have to deal with a lot of nuances that come with such smart solutions. Majorly, the overall workflow of developing any prompt is very ad hoc and manual in nature, because you have to iterate, perform a lot of trial and error on these prompts. And in a majority of use cases, you will be maintaining these prompts in either code or Google Docs or any random notebooks, essentially.
So there is no clear way of discovering what these prompts do. And also, reusability of these prompts is something which is very sparse across various companies. So, primarily, let's say if a specific prompt engineering or prompt tuning technique worked well for prompt A, readily applying it on prompt B is pretty much redoing most of the items you have done. Basically, trial and error and all the other testing mechanisms you have to do with that. Other than this, discovering what techniques worked well and how can we learn from other engineers who built similar sort of prompts is something which is non-existent in most of the workflows here.
2. Structured Approach for Prompt Development
Addressing the challenges of prompt growth and complexity, the toolkit provides a structured approach for prompt development, ensuring organized and reusable templates for various LLM-based solutions like RAG and Qshort. Dynamic data injection, repository maintenance, and shared tuning mechanisms enhance development velocity.
And also, as the prompt grows, it becomes more and more brittle and non-deterministic because it leads to a lot of hallucinations and all the other side effects of using elements. And with this growing complexity, the engineering velocity also drops. Sounds familiar? Yes. In order to make sure that prompt development is less chaotic, the challenge for us is to bring order to this chaos.
In order to make sure that the prompt development is as organized as possible and we could bring that order to chaos, as we were talking about in the previous slide, that's where the prompt engineering toolkit comes into play. It gives developers a clear framework on how to author, version, and test prompts. Save these prompts as templates, which allows you to reuse them across various use cases in conjunction with other LLM-based solutions such as RAG or Qshort and Zeroshort example-based prompting.
And how you can dynamically inject those examples on these prompts. This comes down to supporting runtime dynamic data substitution on these prompts as well as the maintenance of a huge repository of prompts, which allows you to learn from other engineers and look at other prompts and identify how you can reuse some portions of the prompts or reuse some of those prompts itself. And make sure that their tuning mechanisms are readily shared to other prompt engineers. And this allows you to increase a lot of velocity in developing these prompts.
3. React-Like Functionality for Prompt Development
Structured, composable, and testable prompts. Integration with version control and CI-CD pipeline for automated evaluations. Templates with system instructions and model parameters, allowing easy model switching and integration via API gateway.
Think of it like React for prompts. Structured, composable, and testable in nature. Developers can focus majorly on the logic, while we have the repeated boilerplate and golden datasets and other major examples available to them, which they could reuse from other repositories as such. We also integrated with version control and CI-CD pipeline so that you could run automated evaluations on these prompt templates as soon as they are committed. This allows you to identify regressions as well as deviations from the original solution in a more metric-oriented manner.
Let us go into one of the templates and see how it works. So if you see here, the template here contains a name, description, system instructions, and model parameters. Now this one is a simple question-answer bot where we are asking you to answer the questions. It is very rudimentary in nature. You will see that the model we are using is llama, and we have set the temperature to 0.5 and the max token to 100. Now you can quickly run this prompt and see how it executes.
Well, it is answering regarding penguins. How is this happening? So let's go to the test. Here you can see the overall prompt toolkit provides you a client where you can pass a set of messages. Each can be a conversation from the user, and here the user is asking fun facts regarding penguins. And the llama model has returned the response. Now let's say you are using a llama model, and you want to switch to Gemma. You need not create different integrations as such. This toolkit integrates with almost all the models via its API gateway. For this demonstration, I am using a common public gateway, but you can use any of your API gateways to do this.
4. Advanced Quality Evaluation Mechanisms for Prompts
Using Gemma with additional templates for prompt enrichment. Importance of maintaining prompt quality. Evaluation mechanisms: LLM judge-based and human in the loop.
So, if you see here, it is using Gemma, and it has the same prompt. But if you see, there is an additional template here. This kind of templates can allow you to inject examples, or inject additional parameters from rag based queries or a few short examples that you maintain, which enriches your prompt. You can perform the same execution, and it pretty much returns the result on this, and the tests are similar.
Now let's talk about quality. If prompt breaks in production, it is game over. As you are going through a bunch of iterations in your prompt development lifecycle, regressions are given. So, you have to make sure that your prompt iterations are maintaining the same level of quality as the previous one. So, the prompt toolkit provides you a mechanism to evaluate your prompts under two conditions. Majorly, LLM as a judge based evaluation, which is primarily an automated evaluation mechanism. The second one is human in the loop evaluation.
Let's talk about the first one. In the first one, we would use a larger language model, usually which has been benchmarked for quality. And then we run it as a judge, where it runs the same prompt and identifies if the response matches the test response. So on the right side, if you see, we have added a couple of tests. We'll go into the working of it in a minute. You have the test name, input and output. The judge LLM would basically generate the same response and compare it semantically if it is true or false.
5. Integration of Human in the Loop Evaluations
Subjective metrics in evaluations, including tone and correctness. Integration of human in the loop evaluations with LLM judging. Sampling responses to maintain quality evaluations and identify edge cases.
These are subjective metrics, majorly focusing on tone, style as well as the correctness, conciseness of the response. While human in the loop evaluation would detect the nuances and intent as well as tone of the response itself, also flagged edge cases and hallucinations. This can be fed back into the LLM as a judge evaluation by updating your golden data sets or your tests in such a way that the new human evaluations will become automated. This happens as a life cycle and you can eventually get a very robust set of test cases and still identify any edge cases using human in the loop evaluation.
Usually, we do sampling of responses so that we are not doing a very large scale human in the loop evaluation. We do a small subset of the overall responses. Let us look at how it works. You have previously seen that this is the template. Now let's see how tests can be added to it. Here if you see, we have a couple of tests, primarily one positive and a negative test. I have added a contrary example here.
This prompt is summarizing customer support tickets. We are trying to identify it as a short response. We are using LLM model here. Same temperature and max token constraints. If you see the first one, the user is providing a review and we are basically summarizing it. The second one, the user is thanking for faster delivery. We are mentioning it as a negative test output. Primarily, let us just run it and see. If you see the test 2 failed as expected. Here we are providing the reasoning why it failed.
6. Prompt Test Response Evaluation and Insights
Identifying test responses, maintaining test readiness, and preventing prompt regression. Evaluator overview, template structure, and testing approach. Insights on prompt drift, versioning, real user feedback, and evaluation automation.
The overall test response we are trying to see is positive response, but we have identified negative ones. We are able to flag that. The second test passed because it is something which we have expected. You can run these kinds of tests on these templates. As you iterate on these prompts, you can still keep these test cases ready. So that you can follow a more test-driven development kind of an approach. So that your prompt is not regressing or deviating from the expected quality datasets.
Now let us look at the evaluator quickly. This evaluator is doing nothing but a small evaluation where it is augmenting the initial prompt for each test. This augmenter is dynamically substituting the responses in the template itself. Let us look at the test template to get a better idea of it. Here, the test template is primarily setting a persona to the LLM. It is asking based on the input prompt and the test input, generate an output. We have given a small template. Then we are asking the LLM to return true or false. If it is true, do not return any content. If it is false, return the reason for it. We are using pretty much the same model and parameters here. This is just for demonstration purposes.
Now when you go into the evaluator, we primarily run these tests against the LLM itself and log the responses as expected. This is how we are basically running these tests and making sure that every prompt goes through this evaluation pipeline. We automate this entire evaluation within our ecosystem and make sure that the prompts are not deviating much from the expected output quality. Now let us see what we have learned from developing this toolkit. The major learning from this is prompt drift is real. The thing is that any small change in your prompt usually causes it to deviate drastically in some scenarios. Treating your prompts as code by versioning and testing as well as reviewing them regularly would ensure that you maintain the same level of production quality in your products. Prompt quality usually improves by real user feedback.
Comments