You could divide by sentences, you could divide by paragraphs, you could divide by characters and how many sentences, how many paragraphs, and how many characters. There's some things that would be very specific of your use case that you will really need to try and test which one works for you. In this case, I'm using a chunk size and characters of 15,000 and a chunk overlap of 150.
Chunk overlaps are crucial and it basically ensures that we don't lose context at each chunk. So as we are going to chunk all of this information, it's important to have a small preview of what's next and also from the next chapter or if we're talking about an example of books. So overlaps are very important to have. So what we're doing is basically going through each of this information, characters, chunk them, and then now that we have this chunk of data, what's next? We're going to go through something called embedding.
So embedding is a technique that transform data, in this case text, to capture the semantic meaning of it. So each embedding is essentially a list of numbers, typically between 700 and 1,000, and it positions the text in a high dimensional space where similar meanings are closer together. As I mentioned, it's 700 and 1,000 dimensions. If we want to do it or see it in a very basic example, we can look at this image from SuperBase. So this is a two dimension and how it's going to be mapping things. So if I have the sentence, the cat chases the mouse, it's very similar to the next sentence. So they're going to be close together. But another sentence that has nothing to do with it, it would be far. And this is just in a two dimension. So we're going to rely on embedding models to transform these chunks of data and put it in a dimension space so that later we can retrieve that information based on how close and semantically these meanings are.
So as I mentioned, we're going to use an embedding model. In this case, I'm using OpenAI Ada. And I'm going to go from each of these chunks and make it an embedding. So the chunks that I did for 15,000, these are going to transform into an embedding. And it's basically vectors, a representational number of where they're put in a dimension. And now that I have this data, I'm going to need a place to find it later on. And this is where vector databases come into play. I will need to start these data in this special type of databases, vector databases. And it would look something like this. Remember, I have my 15,000 character content. So this is basically the content. And in the other side, I have all the vectors that have the meaning of that content. This is going to be used to later retrieve this information with something called similarity search.
Comments