And then you write your JavaScript code, so in this case, just a few lines of code to run Whisper. And then behind the scenes, it uses Onyx Runtime to be able to run your model in either web assembly on CPU, web GPU on GPU, or with Web NN on CPU, GPU, or NPU. So let's see how you would start maybe adding ML to your web applications. So just a few lines of code. Number one, you import the library. Number two, you create a pipeline. And then number three, you run it on some input. And you're also able to specify a custom model. So as your second parameter to the pipeline function, you can specify the model you'd like to use. And then some other options. So you have the initialization parameters. So if you want to specify whether you want to run on GPU, so web GPU, as well as quantization settings like Q4, F16, that's in the first step. And then in the second step, at a runtime parameters like the maximum number of tokens that you'll be generating.
A quick little overview of what web GPU is. It's meant to be the successor to WebGL and is a new web standard that allows you to run modern, it's a modern API for accelerated graphics and compute. And more importantly, it's a general purpose API, which means that we can run, you know, machine learning operations, which is really, really important. And you can enable web GPU support by just specifying the device as web GPU. It's a little bit limited availability as I was mentioning earlier, so but we hope to see browsers move towards better support in future. And then WebNN as well, you can also specify the device by saying WebNN either maybe if you want NPU, GPU, or CPU, you can also specify that in this case. And then a quick slide on what quantizations, you can basically, because it's very important, because browsers are extremely resource limited, we encourage users to quantize their models and reduce the precision in exchange for lower resource consumption, lower memory consumption, lower bandwidth to the user because they need to download it once and then lower memory consumption at runtime.
We also expose different maybe APIs that users would like to achieve a little more control. So for example, a segmentation demo like this. Maybe some factors to consider as you are developing for the web, bandwidth is important so the user needs to download the model once so you encourage them to choose models that can run on the target hardware. Accuracy versus speed, what level of quantization are you going to use and what latency and precision is required. Device features, what browser APIs are required, maybe microphone input and web GPU support is one of the top ones, and then target devices, are you building for mobile, desktop, and anything in between. Maybe just a quick run through of some applications that you can build. So of course, privacy focused chat bots, being able to run LLMs in the browser is maybe not new now, but the performance we're able to achieve is quite remarkable, especially on web GPU. So in this case, a 4B model running at around 90 tokens a second on an RTX 4090, and then on this Mac here running a 1.7B model at around 130 tokens per second, which is really great to see. Really major improvements, and major improvements still coming. With the native web GPU, I think there's a lot of improvements that we can still make.
Comments