English versionEN

Prompt Engineering Toolkit

A centralized prompt engineering toolkit to help developers build better applications powered by LLMs. This toolkit provides tools and best practices for crafting effective prompts, managing prompt versions, and evaluating the performance of LLM-based features. Furthermore, there’s a need for version control, collaboration, and robust safety measures (hallucination checks, standardized evaluation framework, and a safety policy) to ensure responsible AI usage.

This talk has been presented at JSNation 2025, check out the latest edition of this JavaScript Conference.

FAQ

Manoj Sureddy is a staff software engineer at Uber, leading the customer support automations and generative AI chatbots team.

The toolkit allows developers to save prompts as templates, enabling reuse across various use cases and facilitating the integration with other LLM-based solutions.

The toolkit uses LLM as a judge-based evaluation and human-in-the-loop evaluation to maintain prompt quality and identify deviations from expected outputs.

The toolkit integrates with version control and CI-CD pipelines, allowing automated evaluations of prompt templates upon commitment to identify regressions and deviations.

Human-in-the-loop evaluation detects nuances, intent, and potential hallucinations in prompts, providing valuable feedback that can be integrated into automated evaluation systems to improve quality.

Prompt drift refers to deviations in prompt performance due to small changes. It is managed by treating prompts like code, with regular versioning, testing, and review to ensure consistent production quality.

The primary advantage is the ability to increase engineering velocity by providing a structured, composable, and testable framework for prompt development, similar to how React is used for web development.

The toolkit automates evaluations by integrating with version control systems and CI-CD pipelines, running automated tests on prompt templates to maintain quality and identify deviations.

The toolkit addresses the chaotic nature of prompt development by providing a structured framework for authoring, versioning, and testing prompts, allowing for organized and scalable use of LLM-based solutions.

artificial intelligence

Manoj Sureddi

16 min

16 Jun, 2025

Comments

Video Summary and Transcription

Manoj Sureddy discusses building a toolkit for prompt engineering with LLM-based solutions, emphasizing the need for a structured approach like React. The toolkit provides a structured approach for prompt development, ensuring organized and reusable templates for various LLM-based solutions. Integration with version control and CI-CD pipeline for automated evaluations, advanced quality evaluation mechanisms using Gemma, and integration of human in the loop evaluations. Focus on maintaining prompt quality, subjective metrics in evaluations, and insights on prompt drift, versioning, real user feedback, and evaluation automation.

Available in Español: Prompt Engineering Toolkit

1. Prompt Engineering Toolkit for LLM Solutions

Short description:

Manoj Sureddy discusses building a toolkit for prompt engineering with LLM-based solutions, addressing challenges in developing and maintaining prompts with limited reusability and organization, emphasizing the need for a structured approach like React for prompts.

Hey, everyone. I'm Manoj Sureddy. I work as a staff software engineer at Uber. I lead the customer support automations and generative AI chatbots team. Today, I'm going to talk about how we built a prompt engineering toolkit that allows you to use LLM-based solutions at scale. With the advent of chatGBT and other LLM models, there has been an explosion of using generative AI across various products in multiple companies, and that has been the same trend for us.

And let's face it, LLMs are amazing, but when you're using them in production, not so much, because we have to deal with a lot of nuances that come with such smart solutions. Majorly, the overall workflow of developing any prompt is very ad hoc and manual in nature, because you have to iterate, perform a lot of trial and error on these prompts. And in a majority of use cases, you will be maintaining these prompts in either code or Google Docs or any random notebooks, essentially.

So there is no clear way of discovering what these prompts do. And also, reusability of these prompts is something which is very sparse across various companies. So, primarily, let's say if a specific prompt engineering or prompt tuning technique worked well for prompt A, readily applying it on prompt B is pretty much redoing most of the items you have done. Basically, trial and error and all the other testing mechanisms you have to do with that. Other than this, discovering what techniques worked well and how can we learn from other engineers who built similar sort of prompts is something which is non-existent in most of the workflows here.

2. Structured Approach for Prompt Development

Short description:

Addressing the challenges of prompt growth and complexity, the toolkit provides a structured approach for prompt development, ensuring organized and reusable templates for various LLM-based solutions like RAG and Qshort. Dynamic data injection, repository maintenance, and shared tuning mechanisms enhance development velocity.

And also, as the prompt grows, it becomes more and more brittle and non-deterministic because it leads to a lot of hallucinations and all the other side effects of using elements. And with this growing complexity, the engineering velocity also drops. Sounds familiar? Yes. In order to make sure that prompt development is less chaotic, the challenge for us is to bring order to this chaos.

In order to make sure that the prompt development is as organized as possible and we could bring that order to chaos, as we were talking about in the previous slide, that's where the prompt engineering toolkit comes into play. It gives developers a clear framework on how to author, version, and test prompts. Save these prompts as templates, which allows you to reuse them across various use cases in conjunction with other LLM-based solutions such as RAG or Qshort and Zeroshort example-based prompting.

And how you can dynamically inject those examples on these prompts. This comes down to supporting runtime dynamic data substitution on these prompts as well as the maintenance of a huge repository of prompts, which allows you to learn from other engineers and look at other prompts and identify how you can reuse some portions of the prompts or reuse some of those prompts itself. And make sure that their tuning mechanisms are readily shared to other prompt engineers. And this allows you to increase a lot of velocity in developing these prompts.

3. React-Like Functionality for Prompt Development

Short description:

Structured, composable, and testable prompts. Integration with version control and CI-CD pipeline for automated evaluations. Templates with system instructions and model parameters, allowing easy model switching and integration via API gateway.

Think of it like React for prompts. Structured, composable, and testable in nature. Developers can focus majorly on the logic, while we have the repeated boilerplate and golden datasets and other major examples available to them, which they could reuse from other repositories as such. We also integrated with version control and CI-CD pipeline so that you could run automated evaluations on these prompt templates as soon as they are committed. This allows you to identify regressions as well as deviations from the original solution in a more metric-oriented manner.

Let us go into one of the templates and see how it works. So if you see here, the template here contains a name, description, system instructions, and model parameters. Now this one is a simple question-answer bot where we are asking you to answer the questions. It is very rudimentary in nature. You will see that the model we are using is llama, and we have set the temperature to 0.5 and the max token to 100. Now you can quickly run this prompt and see how it executes.

Well, it is answering regarding penguins. How is this happening? So let's go to the test. Here you can see the overall prompt toolkit provides you a client where you can pass a set of messages. Each can be a conversation from the user, and here the user is asking fun facts regarding penguins. And the llama model has returned the response. Now let's say you are using a llama model, and you want to switch to Gemma. You need not create different integrations as such. This toolkit integrates with almost all the models via its API gateway. For this demonstration, I am using a common public gateway, but you can use any of your API gateways to do this.

4. Advanced Quality Evaluation Mechanisms for Prompts

Short description:

Using Gemma with additional templates for prompt enrichment. Importance of maintaining prompt quality. Evaluation mechanisms: LLM judge-based and human in the loop.

So, if you see here, it is using Gemma, and it has the same prompt. But if you see, there is an additional template here. This kind of templates can allow you to inject examples, or inject additional parameters from rag based queries or a few short examples that you maintain, which enriches your prompt. You can perform the same execution, and it pretty much returns the result on this, and the tests are similar.

Now let's talk about quality. If prompt breaks in production, it is game over. As you are going through a bunch of iterations in your prompt development lifecycle, regressions are given. So, you have to make sure that your prompt iterations are maintaining the same level of quality as the previous one. So, the prompt toolkit provides you a mechanism to evaluate your prompts under two conditions. Majorly, LLM as a judge based evaluation, which is primarily an automated evaluation mechanism. The second one is human in the loop evaluation.

Let's talk about the first one. In the first one, we would use a larger language model, usually which has been benchmarked for quality. And then we run it as a judge, where it runs the same prompt and identifies if the response matches the test response. So on the right side, if you see, we have added a couple of tests. We'll go into the working of it in a minute. You have the test name, input and output. The judge LLM would basically generate the same response and compare it semantically if it is true or false.

5. Integration of Human in the Loop Evaluations

Short description:

Subjective metrics in evaluations, including tone and correctness. Integration of human in the loop evaluations with LLM judging. Sampling responses to maintain quality evaluations and identify edge cases.

These are subjective metrics, majorly focusing on tone, style as well as the correctness, conciseness of the response. While human in the loop evaluation would detect the nuances and intent as well as tone of the response itself, also flagged edge cases and hallucinations. This can be fed back into the LLM as a judge evaluation by updating your golden data sets or your tests in such a way that the new human evaluations will become automated. This happens as a life cycle and you can eventually get a very robust set of test cases and still identify any edge cases using human in the loop evaluation.

Usually, we do sampling of responses so that we are not doing a very large scale human in the loop evaluation. We do a small subset of the overall responses. Let us look at how it works. You have previously seen that this is the template. Now let's see how tests can be added to it. Here if you see, we have a couple of tests, primarily one positive and a negative test. I have added a contrary example here.

This prompt is summarizing customer support tickets. We are trying to identify it as a short response. We are using LLM model here. Same temperature and max token constraints. If you see the first one, the user is providing a review and we are basically summarizing it. The second one, the user is thanking for faster delivery. We are mentioning it as a negative test output. Primarily, let us just run it and see. If you see the test 2 failed as expected. Here we are providing the reasoning why it failed.

6. Prompt Test Response Evaluation and Insights

Short description:

Identifying test responses, maintaining test readiness, and preventing prompt regression. Evaluator overview, template structure, and testing approach. Insights on prompt drift, versioning, real user feedback, and evaluation automation.

The overall test response we are trying to see is positive response, but we have identified negative ones. We are able to flag that. The second test passed because it is something which we have expected. You can run these kinds of tests on these templates. As you iterate on these prompts, you can still keep these test cases ready. So that you can follow a more test-driven development kind of an approach. So that your prompt is not regressing or deviating from the expected quality datasets.

Now let us look at the evaluator quickly. This evaluator is doing nothing but a small evaluation where it is augmenting the initial prompt for each test. This augmenter is dynamically substituting the responses in the template itself. Let us look at the test template to get a better idea of it. Here, the test template is primarily setting a persona to the LLM. It is asking based on the input prompt and the test input, generate an output. We have given a small template. Then we are asking the LLM to return true or false. If it is true, do not return any content. If it is false, return the reason for it. We are using pretty much the same model and parameters here. This is just for demonstration purposes.

Now when you go into the evaluator, we primarily run these tests against the LLM itself and log the responses as expected. This is how we are basically running these tests and making sure that every prompt goes through this evaluation pipeline. We automate this entire evaluation within our ecosystem and make sure that the prompts are not deviating much from the expected output quality. Now let us see what we have learned from developing this toolkit. The major learning from this is prompt drift is real. The thing is that any small change in your prompt usually causes it to deviate drastically in some scenarios. Treating your prompts as code by versioning and testing as well as reviewing them regularly would ensure that you maintain the same level of production quality in your products. Prompt quality usually improves by real user feedback.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Building a Voice-Enabled AI Assistant With Javascript

JSNation 2023

21 min

Building a Voice-Enabled AI Assistant With Javascript

Top Content

Tejas Kumar

Author of the "Fluent React" bestselling book, software engineer with 23 years of experience, and host of the developer-loved ConTejas Code podcast.

This Talk discusses building a voice-activated AI assistant using web APIs and JavaScript. It covers using the Web Speech API for speech recognition and the speech synthesis API for text to speech. The speaker demonstrates how to communicate with the Open AI API and handle the response. The Talk also explores enabling speech recognition and addressing the user. The speaker concludes by mentioning the possibility of creating a product out of the project and using Tauri for native desktop-like experiences.

artificial intelligence case study

The Ai-Assisted Developer Workflow: Build Faster and Smarter Today

JSNation US 2024

31 min

The Ai-Assisted Developer Workflow: Build Faster and Smarter Today

Top Content

Addy Osmani

Engineering Leader Working on Google Chrome

AI is transforming software engineering by using agents to help with coding. Agents can autonomously complete tasks and make decisions based on data. Collaborative AI and automation are opening new possibilities in code generation. Bolt is a powerful tool for troubleshooting, bug fixing, and authentication. Code generation tools like Copilot and Cursor provide support for selecting models and codebase awareness. Cline is a useful extension for website inspection and testing. Guidelines for coding with agents include defining requirements, choosing the right model, and frequent testing. Clear and concise instructions are crucial in AI-generated code. Experienced engineers are still necessary in understanding architecture and problem-solving. Energy consumption insights and sustainability are discussed in the Talk.

artificial intelligence

The Rise of the AI Engineer

React Summit US 2023

30 min

The Rise of the AI Engineer

Top Content

Watch video: The Rise of the AI Engineer

Shawn Swyx Wang

Latent.Space Editor & Smol.ai Founder

The rise of AI engineers is driven by the demand for AI and the emergence of ML research and engineering organizations. Start-ups are leveraging AI through APIs, resulting in a time-to-market advantage. The future of AI engineering holds promising results, with a focus on AI UX and the role of AI agents. Equity in AI and the central problems of AI engineering require collective efforts to address. The day-to-day life of an AI engineer involves working on products or infrastructure and dealing with specialties and tools specific to the field.

artificial intelligence web development future of development builders and founders

AI and Web Development: Hype or Reality

JSNation 2023

24 min

AI and Web Development: Hype or Reality

Top Content

Wes Bos

Full Stack Developer, Speaker & Teacher, Co-host of Syntax.fm podcast.

This talk explores the use of AI in web development, including tools like GitHub Copilot and Fig for CLI commands. AI can generate boilerplate code, provide context-aware solutions, and generate dummy data. It can also assist with CSS selectors and regexes, and be integrated into applications. AI is used to enhance the podcast experience by transcribing episodes and providing JSON data. The talk also discusses formatting AI output, crafting requests, and analyzing embeddings for similarity.

artificial intelligence productivity

Web Apps of the Future With Web AI

JSNation 2024

32 min

Web Apps of the Future With Web AI

Jason Mayes

Web AI Lead at Google.

Web AI in JavaScript allows for running machine learning models client-side in a web browser, offering advantages such as privacy, offline capabilities, low latency, and cost savings. Various AI models can be used for tasks like background blur, text toxicity detection, 3D data extraction, face mesh recognition, hand tracking, pose detection, and body segmentation. JavaScript libraries like MediaPipe LLM inference API and Visual Blocks facilitate the use of AI models. Web AI is in its early stages but has the potential to revolutionize web experiences and improve accessibility.

artificial intelligence

Code coverage with AI

TestJS Summit 2023

8 min

Code coverage with AI

Premium

Jaap Brasser

Codium

Codium is a generative AI assistant for software development that offers code explanation, test generation, and collaboration features. It can generate tests for a GraphQL API in VS Code, improve code coverage, and even document tests. Codium allows analyzing specific code lines, generating tests based on existing ones, and answering code-related questions. It can also provide suggestions for code improvement, help with code refactoring, and assist with writing commit messages.

artificial intelligence

Workshops on related topic

AI on Demand: Serverless AI

DevOps.js Conf 2024

163 min

AI on Demand: Serverless AI

Top Content

Featured WorkshopFree

Nathan Disidore

In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.

artificial intelligence architecture serverless

AI for React Developers

React Advanced 2024

142 min

AI for React Developers

Top Content

Featured Workshop

Eve Porcello

Knowledge of AI tooling is critical for future-proofing the careers of React developers, and the Vercel suite of AI tools is an approachable on-ramp. In this course, we’ll take a closer look at the Vercel AI SDK and how this can help React developers build streaming interfaces with JavaScript and Next.js. We’ll also incorporate additional 3rd party APIs to build and deploy a music visualization app.
Topics:- Creating a React Project with Next.js- Choosing a LLM- Customizing Streaming Interfaces- Building Routes- Creating and Generating Components - Using Hooks (useChat, useCompletion, useActions, etc)

react next.js artificial intelligence

Building Full Stack Apps With Cursor

JSNation 2025

46 min

Building Full Stack Apps With Cursor

Featured Workshop

Mike Mikula

In this workshop I’ll cover a repeatable process on how to spin up full stack apps in Cursor. Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development. We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more! By the end expect to be able to run your own AI generated full stack app on your machine!
Please, find the FAQ here

artificial intelligence

Vibe coding with Cline

JSNation 2025

64 min

Vibe coding with Cline

Featured Workshop

Nik Pash

The way we write code is fundamentally changing. Instead of getting stuck in nested loops and implementation details, imagine focusing purely on architecture and creative problem-solving while your AI pair programmer handles the execution. In this hands-on workshop, I'll show you how to leverage Cline (an autonomous coding agent that recently hit 1M VS Code downloads) to dramatically accelerate your development workflow through a practice we call "vibe coding" - where humans focus on high-level thinking and AI handles the implementation.You'll discover:The fundamental principles of "vibe coding" and how it differs from traditional developmentHow to architect solutions at a high level and have AI implement them accuratelyLive demo: Building a production-grade caching system in Go that saved us $500/weekTechniques for using AI to understand complex codebases in minutes instead of hoursBest practices for prompting AI agents to get exactly the code you wantCommon pitfalls to avoid when working with AI coding assistantsStrategies for using AI to accelerate learning and reduce dependency on senior engineersHow to effectively combine human creativity with AI implementation capabilitiesWhether you're a junior developer looking to accelerate your learning or a senior engineer wanting to optimize your workflow, you'll leave this workshop with practical experience in AI-assisted development that you can immediately apply to your projects. Through live coding demos and hands-on exercises, you'll learn how to leverage Cline to write better code faster while focusing on what matters - solving real problems.

artificial intelligence

Free webinar: Building Full Stack Apps With Cursor

Productivity Conf for Devs and Tech Leaders

71 min

Free webinar: Building Full Stack Apps With Cursor

Top Content

WorkshopFree

Mike Mikula

In this webinar I’ll cover a repeatable process on how to spin up full stack apps in Cursor. Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development. We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more! By the end expect to be able to run your own ai generated full stack app on your machine!

artificial intelligence fullstack

Working With OpenAI and Prompt Engineering for React Developers

React Advanced 2023

98 min

Working With OpenAI and Prompt Engineering for React Developers

Top Content

Workshop

Richard Moss

In this workshop we'll take a tour of applied AI from the perspective of front end developers, zooming in on the emerging best practices when it comes to working with LLMs to build great products. This workshop is based on learnings from working with the OpenAI API from its debut last November to build out a working MVP which became PowerModeAI (A customer facing ideation and slide creation tool).
In the workshop they'll be a mix of presentation and hands on exercises to cover topics including:
- GPT fundamentals- Pitfalls of LLMs- Prompt engineering best practices and techniques- Using the playground effectively- Installing and configuring the OpenAI SDK- Approaches to working with the API and prompt management- Implementing the API to build an AI powered customer facing application- Fine tuning and embeddings- Emerging best practice on LLMOps

artificial intelligence openai react and ai