Video Summary and Transcription
Maya Chavin, a senior software engineer at Microsoft, discusses generative AI and the core model for LM. The flow of a document Q&A service and the importance of prompts in enhancing it are explored. The injection and querying phases of document Q&A are explained, emphasizing the need for efficient storage, indexing, and computing relevant prompts. The talk also covers the use of embedding models, optimization strategies, and the challenges of testing and validating AI results. Creative uses of LLMs and the impact of AI on job security are mentioned.
1. Introduction to Generative AI and LM
Hi, everyone. I'm Maya Chavin, a senior software engineer at Microsoft. Today's talk is about generative AI and the core model for LM. We'll discuss the flow of a document Q&A service and how to enhance it using prompts. LM is a large language model that allows us to process human input and train its own data. It works with tokens. Token is a piece of words that need to be translated for the model to understand. To count tokens, we can use a token counter application.
Hi, everyone. You have your lunch? Are you awake or sleepy? Okay, because I don't have real coffee here, so I hope that you already have your coffee. If not, I'm sorry, but this is going to be the most boring talk in your life. No, I really hope not. But anyway, so before and foremost, my name is Maya Chavin. I'm a senior software engineer at Microsoft. I'm working in a team called Microsoft Industrial AI, which we leverage different AI technologies to build AI integrated solution and applications for industry specific.
Sorry, my voice today is lost during the flight, so I don't know what happened. So if it's hard for you to understand me, I'm really sorry. And if you want to understand me better, please feel free to contact me after the talk, okay? I've been, like the introduction, I've been working with web and JavaScript and TypeScript, but today's talk, it has nothing to do with TypeScript or JavaScript or anything. It's talking about AI. And first and foremost, how many people here working with AI or generative AI? Okay, so we can skip this slide.
Now, anyway, so for people who doesn't know about generative AI or maybe know about the term but never have a chance to experience it. So generative AI is an AI that can generate text and media from a varieties of input data, which is we call it prompts, basically text or anything, like now we can also send it some image for it to analyze and also learn from their system data. And that is our talk, we'll based on it, which we will talk about what are the core model, what are the core model for LM or generative AI to use. And our talk also will focus about how we're going to use the model and to define what the core flow of a very simple service, document Q&A, where you can find it on Google a hundred times when you Google for document Q&A using AI. But in this talk, we will learn a bit more what the flow behind it, what we can, what kind of service we can use for each different component inside the flow in LM, and finally how we can enhance and expand the service using prompts or any technique that we can pay attention to when we develop a new document Q&A as a generic service. Okay.
But first and foremost, LM. How many people here working with LM, any model LM? What LM do you use? GPT? GPT? Text embedded? DALY? Raise your hand. Come on, I believe that you already have coffee, right? Anyway, so just a recap, LM as a service is a large language model which allow us to, which be able to process human input. And then it will also have capable of training its own data, whether it's supervised or unsupervised, and it works with token. And the nice thing about LM is that it provide you a set of API at the black box that help developer develop AI applications more straightforward and more simply than before. Okay. So some of the LM we can see here, OpenAI, Google, Microsoft, Meta, Anthropic, Hugging Face, nothing new here.
So we talk about LM working with token, right? So what exactly is token? Well, to put it simple, token is just a piece of words, which mean every single words in a sentence you have to translate it to token. And to count the token, we have some calculator that we can use to count the token. It's called token counter, which is right here. I have it in the, this is applications, that you can go here and write your text in here and it will generate for you how much, how many token it will take you to, it will cost you to pass this string to the AI.
2. Core Capabilities for Document Q&A
In this part, we'll discuss the core capabilities for document Q&A, including completion, chat, and retrieval. Completion API allows AI to complete user tasks, while chat is an extension of completion. Retrieval enables search, generating vector representations of text. Document Q&A is not complex, but it's crucial to implement correctly to avoid issues like the AI chatbot used by Air Canada. Document Q&A as a service is a simple text input and button where users ask questions and receive AI-generated answers.
I have it in the, this is applications, that you can go here and write your text in here and it will generate for you how much, how many token it will take you to, it will cost you to pass this string to the AI. Okay. This is just a token and you can also see the approximately calculation of token based on OpenAI website. And it's very important because token is money. Literally. We don't work with money, with AI, we work with token.
So when we talk about LM core capability, we have several capability until now, six different one and it's improving. In this talk, we will only focus on three core capability for document Q&A. Completion and chat. Completion and chat, chat is actually completion, extension of completion, so usually when you start an API of completion, you will see the API for chat will have the slash chat at an extension, it's nothing, it's not a separate model, it's using the same completion.
So what is the completion API? Completion API is the API that allow the AI to perform, to complete the task given by user and chatting is also a task given by user. Some of the famous completion API is GPT, Gemini, Cloudy and Lama, it's very hard to pronounce this kind of word. Anyway. So some of these famous completions that we always use when we do chat or text completion and so on, the other one is retrieval. What is retrieval? Retrieval is mean search. You basically, this is a model to allow you to take, to give, to generate some embedding in vector representation of a certain text.
And one of the most popular model of this, API of this is text embedding. Text embedding AIDA, if you ever heard about that for OpenAI, we use it a lot to create, to help us to create a vector representation of a document so that the search algorithm can base on that to find the matching chunks. So this is the three model that we're going to use a lot in document Q&A. Okay.
But before we move to document Q&A, like I said before, document Q&A is not something out of the box. It's not something that really complex but it's something that easily go wrong. For example, Air Canada, well, they got the AI go wrong and they have to pay money for that. Now, there's a argument that the AI chatbot here is actually not AI chatbot. Like, they were written with some dumb algorithm behind and they don't really use chat GPT or any GPT behind it. But again, that's a different story. All I know is that the chatbot go wrong and now the airlines have to pay for that because give misleading information. And that's just one part of the problem that document Q&A is facing if you don't pay attention to what you implement or you don't understand what you implement. So let's take a look at what is document Q&A as a service. So to put it simply, it's just a text input and a button where the user would type inside there a question and send the questions to the AI and ask for an answer.
3. Injection and Query Phases for Document Q&A
In the injection phase, the AI service takes in a document and processes user queries to provide relevant answers. Storage and indexing of document chunks are essential for efficient query processing. The smaller and more relevant the chunks, the fewer tokens needed. Embeddings are used for vector or semantic search. In the querying phase, computing the right prompts and obtaining the relevant data chunks without exceeding token limits is important.
Which means, user asks, AI answers. But not with anything. It have to be within a document range that we call grounding. So in fact, when you look at this description, there's two things happen here. The first thing is the injection phase where the AI, where the service took in a lot of documents, a single document, whether predefined or it's on the fly, uploaded by user. It called the injection phase. And then, based on this provided document, it can process a query or questions given by user and give back to user an answer with the relevant piece of document or data section from the document that is given. In fact, we have two flow here that go like this. So there's two phase here. The most important one is the injection phase because it's providing the ground place for AI to be able to process the query from user and given the right answer. Injection and query.
So what important in the injection phase we need to pay attention to? Every document, every paragraph, every text is tokens. Again, everything present in a document can be translated to token and inexpensive. How are we going to store the data? How are we going to store the document in order for the AI or for the service to find and process the query on the right data? So let's take a look at the injection phase, the flow. Assuming you have several files here and it can be a PDF, it can be a mocked out, it can be a code file even in case you want to generate some document Q&A for your code repo.
So what you need to do, you have to load and parse this document and split it into a structural chunks. The smaller the chunks, the more relevant the chunks, the easier for you to pass a token, to pass it to the AI and save yourself a lot of tokens. And then after that you need to create chunks embedded. Like we say, embedded is the representative of chunks and this is important because you need to index these chunks with the embeddings into a database, an index database, so that whenever the AI, the service, it asks a question, it will try to look for the relevant chunks to inform the questions and then pass it to the AI. The important thing to pay attention is that the embeddings algorithm here, embedding model here is used for vector search or semantic search.
For the querying phase, there's several things we need to pay attention and we need to focus. First thing, how we compute the right prompts? Prompt is also money. How we're going to compute the right prompt, whether we can compute the same prompt for every single industry-specific scenario or we have to change or we have to modify them. We also have to think of how we're going to get the right data, the right chunks instead of passing a large document as together with the user query to the AI and cause our service to break down because we don't have enough token.
4. Querying and Prompting in Document Q&A
In the querying phase, we need to focus on computing the right prompts and obtaining relevant data chunks without exceeding token limits. The flow involves creating embeddings from the input query, computing prompts and summarizations, formatting the answer and chunk, and conducting efficient vector and semantic search to find the most matching chunk. Additionally, we can improve the prompting computing by providing examples for training the AI model.
And in this phase is the one on the left side. When it get the input, it create embeddings ready for the search algorithm to be able to query on.
For the querying phase, there's several things we need to pay attention and we need to focus. First thing, how we compute the right prompts? Prompt is also money. How we're going to compute the right prompt, whether we can compute the same prompt for every single industry-specific scenario or we have to change or we have to modify them. We also have to think of how we're going to get the right data, the right chunks instead of passing a large document as together with the user query to the AI and cause our service to break down because we don't have enough token. And then, how we compute the answer with all the metadata such as citation, name, title and so on.
So let's look at the flow. In the queried flow, we will create embedding from the input query because we need this to be able for the search algorithm to be able to find the match based on this embedding. And then, we'll send, we'll compute the prompts together with the chunks and ask the AI for summarizations from the chunks and the query we receive. And after that, we also have to format the answer and the chunk according to what we want to display to the user. And we return the answer. So for the search, let we say, vector semantics search and in this flow again, it will be on the right side where we create the embedding for query. And then, we will use it. We will pass to a, we will pass the embedding to the algorithm, the search algorithm and it will look upon the store index and find the right, the similar chunks, the most similar chunks and return to us. And it's very, very efficient. Vector search and semantic search together is very efficient in order for us to find the most match chunk. I don't do that.
The second part of the flow here that we can improve is the prompting computing. All you can say is prompt engineer, though I don't know why we call it engineer because there's no engineer here. It's just playing with text. So how are we going to improve that? This is an example of a user prompt, a very simple one that just telling it to read the question and answer the question based on the document given. And it give you and you pass the chunks as part of the, you pass the chunks as part of the question, part of the prompt. And then you also pass the chunk and pass the question as part of the prompt. So how are we going to train, how are we going to train our model, our AI with this, with this prompt? Well, we can do something like this. We can add some example here together where we can give some example chunks and how the format of the answer will look like, like with citation how it would display. And we also can give it some example questions so it know what it should refer to. This one, we call it fine tuning or some will call it few shots. You can also provide more than one example.
5. Improving User Prompt and Service Generality
To improve the user prompt, you can provide more than one example and support localization by specifying the desired language. However, there is no generic document Q&A service as prompts need to target specific industries with specific formats. To make the service more generic, you can deploy multiple instances for different industries.
You can also provide more than one example. And another way to improve the user prompt is when you need to support localization. Let's say if you want to have the documents on English but you want the document Q&A to answer in Chinese or Italian, then there's several way to do it. You can do it with given a prompt, a sentence say, always return the answer in this language and if the GPT support the language proper, it will return to you the right answer in the right language or you can also do it with other thing like pre-process the data into the language or pre-process the query into the target language and so on. But this is the easiest way to do localization.
Other things to make your service become generic. Well, the first one, disclaimer, there's no generic document Q&A service. Every prompt have to target a specific industry because financial report is different from co-file and co-file is different from sustainability report and so on. So the prompt have to be tied to a specific industry with specific format. One way to do to make it a bit more generic is in when you deploy the service, you can create several instance and inject inside your own prompt. Like here, you can inject the topic inside and make sure that the prompt is sell a target per instance. So you can deploy several instance for several industry. It can cost a bit more, but it can support your customer if that's what needed.
6. Flow and Components for Document Assistance
To parse documents, you can use Azure Document Intelligence for PDFs or text splitter for structured files. To create chunk embeddings, try the text embedding adder from OpenAI. Store the embeddings in a search service like Azure AI Search or PyCon. Split the database into two to avoid heavy index databases. For querying, use OpenAI's text embedding to create embeddings.
It can cost a bit more, but it can support your customer if that's what needed.
Okay. And that's enough with prompt. So here come, now we come to the next section. We talk about the flow and the components. What service, what LM services we can use for which component.
Okay. Let's take a look. So document assistance flow for injections. For low parse document, we need something to be able to parse the document, right? So if you get a PDF, you have to use some sort of PDF reader. So you have to use something that it will take the PDF and then output for you the text structure. For that, you can use a service called Azure, oops, sorry. For that, we can use a service called Azure Document Intelligence, which is very good for parsing PDF into a structured data table. But if you only write, only use documentation on a marked out file or on a code file that's already structured, you don't really need document intelligence because it's like, it costs money and it's also very, very heavy. You can use text splitter, which is a legend from legend and it's also good to split the text into chunks.
The other way, the other component is create chunks embedding. For this one, you can use text embedding adder from OpenAI. This text embedding adder tool is very good for create embedding and you don't, almost don't have to do a lot of works in order to create the embeddings for that. And after you create the embeddings, you need to store somewhere. So for storing, you have to put it in a search service, the smart search service, which can be Azure AI Search. I have to add Azure AI Search because I work for Microsoft. But anyway, you can use PyCon. PyCon is the open source product that is very good also. And it's have that vector database where you can save your index inside. And that's not only it. You don't save your chunks with the embeddings on the chunks with all the embeddings on the metadata inside the index database because it will be very heavy for the index database and you may have to pay more. So for that, I suggest that you need to split the database into two. One only for the index database, one only for the index and embeddings and the other one will save the chunks, the original chunks, and it would connect with each other according to ID or something so we can search for it.
Okay. For querying, the first component create embedding, again, we use an open AI with text embedding to create embedding.
7. Embedding Models and Flow Optimization
To ensure accurate matching, use the same model for embedding. GPT 3.5 tuple and a bot provide reliable results. Use PyCon and Azure AI search for searching match chunks. Use Lengen for easy component orchestration. Remember to update the index with new documents instead of rerunning the service. Save minimum metadata with embeddings in the index DB. Optimize tokens by cleaning up queries and compressing prompts.
We need to use the same model so that the search algorithm can find the match because different model would do the embedding differently. And GPT using for S for summarizations by on the chunk and so on, if you want to have a good result, we did this research about it and we found out that GPT 3.5 tuple and a bot will give more and more and more reliable result.
Search for match chunks, we will, again, we have to use this search algorithm, right? We use the smart search service which is PyCon and Azure AI search. And lastly, to be able to monitor the whole flow without a lot of effort, I would suggest to not write in your own pipeline but instead try to use chaining from Lengen and this way all the component will be connected to each other and you don't have to spend time to make one input, one output of component become the other input of another component. That Lengen would take care of that for you. And this we call an orchestration. And Lengen have a lot of different way to do it that's available on the website documentation so you can check it out. Also, you can do semantic kernel but it doesn't support JavaScript anyway.
These are some resources that you can use to build the flow with this component. And lastly, rule of thumbs. If you have to do a pre-inject document Q&A, which means based on the existing documents, which means sometimes the user have to add a new document inside the system, you don't want to rerun the service again. You create a scheduler or something just to update the existing index with the new document which means do the whole flow of injection again but only on that document only. Second, always save the minimum metadata together with embedding in the index DB to save yourself some time. And lastly, optimization. Token optimization is crucial because you don't want to pay money for that. So, you always have to clean up the query. You know why space, trailing space or any software will be count as token. So, one of the thing is you can pass it to the AI, to the GBT and ask it to clean it for you, receive the clean query and put it into the prompt or trying to optimize the prompt by some compressed algorithm. I think MongoDB have some talk, nice talk about how you can compress this. And this will save yourself a lot of token and it will make your document Q&A service become much better.
Testing Embedding Models and AI Validation
Thank you for joining my talk. Embedding models can provide different results. Testing different models is important but challenging. Manual testing and automation are used to validate the results. AI testing itself can be unreliable. Other questions and a plug for DataStax's AstroDB.
And that's it. Thank you for joining my talk. I'm going to start. That's why I said there's a lot of new stuff that we have to kind of, I think, pick up to work with this generated AI stuff and thank you for explaining that.
I wondered about embedding models. Have you found kind of different results using different embedding models at all?
Yes. So, actually, when we do it the initial state, we test it with three different model module and the result is pretty different. Like, I must say, I have the slide but I don't have enough time so I didn't show it. But the difference will be like the misleading depend on which model you use and which prompt you use. It can give you from 20, like, out of 40 questions at the data set, it can give you from 20 questions that correct to 29 questions that correct. So like 40%, 50%, something like that. And there was a 10% margin between the different prompts.
So is there a way that you can test these different models to see whether you're getting which one's going to give you the right kind of result for your use case?
Yeah, that's also another important aspect, testing, right? So yeah, initially, we have the manual testing where the human have to validate one by one but then we have to do 400 question data sets for testing and no one be able to do that. That's a new job for QA, isn't it?
Yeah. The data scientists have to come up with something like automation for LLM, this process of automation. So they pass the questions to the AI and let the AI actually validate it for them and then return it. And then you do the final test with the human.
Right, so the AI is creating it and testing it and then eventually we have a look. But again, it's like you're asking AI to test its own works and then you're still not sure whether it's good or not.
Well at least I don't think they have egos. So if they're like, I'm wrong, then that's fine. That's fine.
Well, if the AI is wrong, that's the problem, right?
Right. We've got some other questions. All right, that's lovely. I have to point out whilst I'm here that you mentioned Pinecone, you mentioned the Azure AI service. I work for DataStax. We also have a vector store that you can use, check out AstroDB. Quick plug, sorry. Okay, this is an interesting one which I've just lost.
Token Usage and Meta Prompts
Constructing prompts with many tokens can be costly. Adding a meta prompt with customization can optimize token usage. Cleaning queries can remove unnecessary words. Jesse Hall's talk on optimizing meta prompts is recommended. Building a specific meta prompt is important, but smaller prompts can work too. Azure AI prioritizes safety, while other platforms may require additional content validation.
The constructed prompt, there we go, adding the instructions on how to answer uses quite a lot of tokens when you've retrieved your chunks and you've got a load of stuff. So is there a way to sort of abstract that at all, make that kind of smaller so that, yeah, I guess it's not costing you so much money. Tokens are money. Well, one of the things is that, that's a tricky question though, to be honest. If one, in one side, if you put a lot of things in the prompts, you can guard the AI better.
So one time I saw one of our, I cannot say which one because it's work, but anyway, so we have a meta prompt that he, he, the guy who wrote it, say that it cover all legal aspects of an AI but it cost about 900 to 1000 token and that's a meta prompt. And if you add your own system prompt inside, like additional prompts to customize it, it can reach to 1,500 token just for your own prompt with the meta prompt. And so to be able to optimize this, the only way that you have to run some kind of a cleaning query to remove all this words that doesn't make sense, like do this or something. I would say, I would suggest everyone to check out a talk of Jesse Hall, I think it's Jesse Hall from MongoDB. He talked about how to optimize the prompts, the meta prompts, for better like token saving, cost saving. And that talk was very good. Yeah, nice. I think here somewhere as well, so.
Yeah, I guess, you know, building a big meta prompt like that is kind of important when you need the model to behave very specifically, like if you don't have quite a stringent kind of things, you can go smaller. Yeah, I mean, if you don't want to cover everything like safety, content safety, I don't know, like misleading, or being of the AI become very happy and conversational, then it's okay, you can make it, you can make it smaller and more. I actually believe, I've seen, I've not used it myself, but I believe that Azure AI does do a lot of work to keep you more safer as well. I don't know if that's using a meta prompt for it. So yeah, that's the thing. If you use open AI, then it's already at the layer of content safety. So you don't have to deal with it. But if you use another open AI, you have to add the contents, like you need to be responsible and write another component to validate the query is not going to be harmful and so on. So yeah, I'm from all... I brought that up, that's fine.
Creative Uses of LLM and Business Rush
Creative uses of LLM include generating music and product descriptions from images. Analyzing video to find relevant points won a hackathon. Businesses' rush to use AI for marketing purposes remains a challenging question. LLMs still require observation and improvement for various applications.
I brought that up, that's fine.
That's a good question. What's the most creative use of an LLM that you've come across? Well, a lot. Like someone generate music using LLM. One of my current project, I mean gem network here, I'm working on generate some product descriptions from the image. So basically I pass the image to the GPT, image of the product. And then I want to get back the image description and the text that related to that product and the description so I can just upload it to, I don't know, a store and have it deployed.
Very nice. Very nice. I assume you could sort of use that to build alt text for images for better accessibility in applications as well. Yeah. Also, one of the project that won our hackathon in Microsoft is to analyze video and give you the... What do you call? The point where the talk is relevant to you. Very cool. Yeah. Nice.
Alright. We've got one more question. You spoke about how the LLMs can go wrong and gave that example of Air Canada. Do you think businesses are rushing to use AI purely for marketing purposes?
Sorry? Do you think the businesses are rushing to use AI just for marketing purposes? Ah, wow, that's a very hard question. I mean, in one end, I'm working for Microsoft, you know? So I have to say, okay. Well, I kind of feel like we still need to monitor, not to monitor, to observe AI in the LLM. It has a lot of potential. Up until now, it's not yet in the production level yet, except co-pilot. But let's say, document Q&A, we still have a lot of things to work on that. And also other things, maybe video generation also. So really, it has a lot of potential, but I still don't feel confident to say that it's there other than marketing purposes. I think, from where I've seen things, I think there are some useful things. I think you're also right. We're very early."
Experimenting with AI and Job Security
Developers should experiment with AI and its potential. AI won't replace jobs for at least 100 years. Using AI can enhance job security. Copilot is useful for code, but not in other areas. Thank you, Maya!
But actually, I'd encourage us as developers to be experimenting with it. Let's not let a marketing or a product person who's gone wild with power decide you can take over. You have to have AI and things right now. Build things yourself with it, and see what it can do. And be the person that can actually suggest those ideas. I think that's, as developers, we owe it to ourselves to kind of experiment with these things and know what we can do with them.
Also, one more thing is that it's not going to be here to take over our job. Yet. Probably in 100 years. So we don't have to worry about it. Three different estimates on that kind of thing. I'm sure I've seen we're all done in three to five. I don't know, maybe I'm wrong. I mean, at least not in my lifetime. We have had another one come through. That was the next question that came through. Do you think we're all going to lose our jobs? Not yet. Cool. Excellent. Good to know. Yeah, well, and again, another way to not lose your job is to also know how to use the AI. So if you're the one powering it until it can take over that and then that's the singularity and it's all over, right? I mean, I don't use Copilot in office. I mean, I have Copilot. I just don't use it. I only use Copilot in code. Because that's the only way I really enjoy using it. Otherwise, it's just pretty annoying.
All right. Well, I think we. That was that. And that was that. Unless there are any more questions, you've got seconds to put it in. We are out of questions. So thank you very much again, Maya. Please give everybody. Give Maya another round of applause. Thank you. Thank you.
Comments