So with that being said and with that in mind, what I want to show you today is how can you actually set up an LLM, such as OpenAI, to chat with your own data.
So, okay. So usually up until now, how we are communicating with LLM models such as OpenAI or LLAMA or any other model, we are just putting some questions, we have a user, he has a question and the model is going to respond to us. But the caveat here is that if we have some proprietary data, the model, unfortunately, is not trained on that data and it won't know how to respond to your question.
So I saw these days a very clear example about this. So, for example, if you want to ask about some policies from your company, so, for example, the vacation days that you have, you cannot ask a model. You can actually have to go to the internal guidelines and policies in your company or to the HR and you have to ask the person and you have to spend time on all of this back and forth. So what we actually can do is we have a way to give the proprietary data to and give a context to the model in order to help us to ask these kind of questions.
So now there are a few more, there are a few approaches that we can actually take. First of all, I want to emphasize when I'm saying proprietary data, some of you probably go directly with the thought to privacy concerns. So there are two things that you can do here. If you are concerned about the privacy of your data, you can actually use a model that you are hosting yourself. So, for example, you can get Lama three models which are open source. You can host them on any cloud provider and you are then sure that your data never leaves this whole environment in this whole architecture. So you have totally control and total privacy for this workshop. I use the open AI just for the convenience because it's already public. It's already there and I don't have to spend time to set it up. But keep it in mind, if you want total privacy, you can host your own model. But not all proprietary data is also sensitive data. So we can actually have a public documentation for an open source project and things like that, that we just can feed to a third party model. So you don't have to worry about that all the time.
So getting back to the presentation and to the approach itself, the first thing that comes to mind is that we can fine tune the model with the proprietary data. But unfortunately, although this is the best thing that we can do, because then the model will natively know the things about the data, this is very expensive and requires machine learning expertise. So fine tuning is actually an art and you have to know how to do it in order to do it right. And you also have to have a lot of proprietary data because otherwise, if you don't have a lot of data about the subject, you won't really make a difference in the model itself because the model is huge. It knows a lot of data. So if I'm just adding a few sentences about a certain topic, it will just get lost in all of the data that is already there. So for this kind of application, fine tuning, it might not really be a solution. So we can actually go and do the naive brute force approach and we can put all the proprietary data into the prompt itself. So before actually asking a question, we can say to the model, here is all the data from my company, all the guidelines or all the policies.
Comments