Okay? And of course, this is cool, but there are some trade-offs that you shouldn't be aware of. First of all, it's this one. So it takes some disk usage to get the model downloaded. So we need to host this. And the way we implemented this in these demos, of course, is not optimal because when the model is executing, the main thread is blocked. So you will need to isolate that, I don't know, using a worker thread or something like this, so you can serve more requests while it's working. And of course, the first load takes some time, so these are things that you should be aware, okay?
But for many applications that are running locally, these models, this local inference, can be very, very useful, and it's pretty simple to do. As we saw, we just install the dependency, and it works like a charm. All the demos here and everything I showed you here, it was working on CPU, but we know that this is not ideal. We want to leverage the GPU because they have better capabilities for handling the models, especially models that involve NLP. Many neural network models leverage the GPU, so we want to use that as well, and you can, actually.
The transformer.js already works well with WebGPU API, so if you're running a browser, just set the device to WebGPU and this is going to work. And if you're working in Node.js, this is going to work, but it's still unstable yet, so you need to install the transformers for Next, and then you get this capability, like the preview features of it, and then you'll be able to run in GPU. So, for example, if I did that same task of text generation, I can get that running on GPU. The execution will look like the same, but now we need to execute in a different chip, so that will be way better for it. So, as you can see here in the demo, on the left I have the execution of the LLM, and on the right I have the GPU history that shows like it's growing. So things are loading on GPU and it's using that chip.
Comments