The most recommended way is WebGPU. WebGPU is a browser API that allows the browser to actually execute GPU operations via JavaScript. And then the D type is basically the quantization, so the compression that you want to use with your model. In that case, you're using the full precision FP32, but it could also be quantized 4-bit, 8-bit, whatever you want.
And then you have the pipe. You can take any input, run that through the model, and you get the output. So the input and output type heavily depends on the task that you're going to use.
OK, let's see how we can solve my problems with Transform.js. Let's start with listening and talking. For the first part, we have a task called automatic speech recognition, which is basically transcribing audio into text.
So we can use the Whisper Tiny model. It's a very small transcription model from OpenAI. We can then create the transcriber. We can then take any audio input, run that through the transcriber, and we will get back the text output.
On the other end, we can use Kokoro. And Kokoro is a very small library that allows us to synthesize text into speech. We can use a library called Kokoro.js. It is a small abstraction on top of Transform.js. It was also written by Joshua, so it's very similar.
What we can do here is you can take any text, we can define a voice, and then it will return the audio. OK, let's try that. So I have a little pipeline, which records what I say. It will transcribe it, synthesize it, and it will output it back.
So I can say, hello, how are you? And if the audio works, hello, how are you? Perfect. So we now have an end-to-end pipeline, but the problem is, it's not actually intelligent. So how can we add intelligence to that pipeline? And the closest thing we have to intelligence right now is large language models.
Now, there are quite a lot of large language models out there. Most of us use models in this upper left corner. We have GPT, we have Gemini, we have Claude. But those are closed-source models.
Comments