English versionEN

Motion Control With Multimodal LLMs

What if you could use multimodal LLMs to interact with websites or IoT devices using motion control?

As advancements in multimodal AI offer new opportunities to push the boundaries of what can be done with this technology, I started wondering how it could be leveraged from the perspective of human-computer interaction.

In this talk, I will take you through my research experimenting with building motion-controlled prototypes using LLMs in JavaScript.

This talk has been presented at JSNation 2025, check out the latest edition of this JavaScript Conference.

FAQ

The talk was about motion control with multimodal AI.

The speaker is a Senior Research Engineer at CrowdStrike, known as DevDevCharlie online, with a background in machine learning, particularly using TensorFlow.js.

The speaker enjoys diving, running, hiking, playing drums, learning German, and has also obtained a radio license.

PoseNet and MoveNet are models used for pose detection, providing key points of a person's body, which can be used for applications like motion-based games.

The speaker uses hand gestures to interact with interfaces, such as controlling lights or playing games, by detecting key points on the hands using models like PoseNet.

Gemini is used for gesture recognition and function calling in the speaker's experiments, allowing for the control of devices like lights through hand gestures.

The speaker faces challenges in accurately detecting right from left gestures and ensuring the AI correctly interprets the intended commands.

The speaker is interested in exploring how LLMs can be used for motion control experiences, aiming to create more intuitive interactions with technology using gestures.

The speaker envisions using motion control technology for home automation, where AI learns user behaviors and automates tasks through gestures and acoustic activity recognition.

The speaker uses tools like TensorFlow.js for key point detection and DataStax for vector databases, along with Gemini for multimodal AI experiments.

artificial intelligence

Charlie Gerard

39 min

12 Jun, 2025

Comments

Video Summary and Transcription

The Talk delves into motion control with multimodal AI, exploring TensorFlow.js models for gesture recognition and enhancing user interactions. It discusses leveraging LLMs for gesture-based interaction, investigating Gemiini for gesture recognition, and controlling light states with Gemini functions. The conversation includes webcam-based gesture recognition, custom gesture databases, and the future of personalized AI assistance with acoustic recognition.

Available in Español: Control de Movimiento Con LLMs Multimodales

1. Introduction to Motion Control with Multimodal AI

Short description:

Welcome to a talk on motion control with multimodal AI. The speaker is a senior research engineer at CrowdStrike, known online as DevDevCharlie. With a background in machine learning, particularly with TensorFlow.js, the focus is on recent advancements in multimodal AI. The speaker is a self-proclaimed creative technologist, exploring the possibilities of JavaScript and the web platform.

Thank you. And welcome to my talk about motion control with multimodal AI. I gave a longer version of this talk recently at another conference, and a colleague of mine watched the recording and she was like, oh, it's like a magic show. So hopefully if everything works well, it will feel maybe like a magic show, but then you'll also understand how it actually is all built. So I was briefly introduced. I'm going to go quickly over this. So yes, I'm a senior research engineer at CrowdStrike. I go by DevDevCharlie online usually. I'm an author, master's instructor. So I've been doing kind of things with machine learning on the web for about eight years now, but primarily using TensorFlow.js before, and this is going to move on to the more recent advancements with multimodal AI. Overall, I guess I'm a self-proclaimed creative technologist, so I like to push the boundaries of what can be done with JavaScript and the web platform, and sometimes try to use tools to kind of make it do maybe what it wasn't necessarily built to do. And outside of tech, I've been spending a big part of the year trying to have hobbies that are non-tech related. It includes diving, running, hiking, playing drums, learning German, and I also got my radio license earlier this year. It's a very niche hobby, so I don't know if anybody here knows what it is, but in case you do, my call sign is ko6hpr if one day you hear me on the radio.

2. Exploring TensorFlow.js Models for Motion Control

Short description:

Discussing previous experiments with TensorFlow.js models like PoseNet and MoveNet for pose detection. Exploring the use of key points data and building interactive experiences with motion control. Augmenting tools with motion detection for enhanced user interactions.

But let's start by talking about previous experiments. So when I introduced myself, I just talked about TensorFlow.js, and I want to cover a little bit the things that can be done with that tool, so then you'll understand a little bit why I'm also experimenting with this with multimodal AI. So there's a few different models that you can use with TensorFlow.js, and one of them is about pose detection. It's called PoseNet or MoveNet. There's a second one as well. And usually, you get key points. So forget about the green lines, it's like the red dots. Depending on the model, you get a different amount of key points, and these key points are raw data, so it's coordinates, x and y coordinates relative to the screen.

And with this data that you get, you can build something like this. So this is a clone of a red light, green light, the game from a squid game, so if you have not watched this series, basically, you have a doll, and when it looks at you, you're supposed to like not move, and when the head is turned, you're supposed to like, in this case, run as close to the screen as you can. Otherwise, like, you die. Basically, if you move when the doll is looking at you, you die. And I wanted to recreate something using PoseNet, and then you start thinking, well, how do I actually code, like, the fact that I'm not moving or moving?

Another model that you can use is one that is specifically around key points on the hands. So right hand and left hand, and here I think you have about 21 key points, and you can build something like this. So I started thinking, well, what if you could augment the tools that you already use but add some kind of motion detection to it as well? So that was interacting with Figma, so it's not necessarily that all of a sudden you will build entire interfaces just with your fingers, but what if you just were augmenting the type of things that you can do?

3. Enhancing User Interactions with Gesture Control

Short description:

Utilizing hand gestures for VR interactions, exploring motion control with key points data on hands, and integrating motion detection into existing tools. Experimenting with gesture-based interactions for manipulating shapes and zoom functionalities. Enhancing user experience with gaze detection and reimagining user interfaces with limited inputs for efficient interactions.

So in this other example, it was a clone of Beat Saber, where originally the UI was an open source repo that was on GitHub, and it was supposed to be connected to a headset by, like, using web VR, and you were supposed to be able to connect this app to joysticks using JavaScript, but I didn't have a VR headset, and I was like, I have hands, so I can just, like, play Beat Saber with my hands. So I basically used two of the key points from PostNet, so my right wrist and my left wrist, and then I kind of used these directions or these coordinates, translated them to the VR world, and being able to kind of, like, play Beat Saber with my hands.

In this example, I think I was using just, like, a pinch gesture to be able to drag shapes around, and then here in this GIF, it's, like, basically when I move my hands like this, but when I do, like, a palm gesture, then this is a zoom, and you would also have to write that on logic. It's, like, what's even a palm gesture versus, like, a fist or something? And then you measure the distance between your hands and you can, you know, do whatever you want with it really.

And finally, another model that you can use is around more face detection key points. This one has more than, I think, 450 key points all around the head, and what I focused on specifically is to build a tool to focus on gaze detection. And with this, you end up having even more questions, not only technically how do you build this, but even in terms of human-computer interaction and user experience. If you only have two inputs, left or right, how do you write something? And here it's kind of like reimagining the keyboard as well. So, you know, if you have a keyboard and you want to select the very last letter, you could technically, as you're looking right, select, you know, go letter by letter, but that would take a while.

4. Redefining Coding Interfaces with Minimal Inputs

Short description:

Redefining keyboard interactions for efficient coding with minimal inputs. Exploring gesture-based coding and interface enhancements with motion control. Addressing challenges in defining gestures for consistent and universal understanding.

So, you know, if you have a keyboard and you want to select the very last letter, you could technically, as you're looking right, select, you know, go letter by letter, but that would take a while. So then you could rethink the keyboard and see, okay, if I split the keyboard into two lists of letters, and let's say I want to have the, I want to write the letter W and it happens to be in the right column, then you look right quickly and then it splits that column again into, and you end up with kind of like a binary search and you end up, you know, being able to select a letter that's like far on the keyboard but faster. And then you would select it with like blinking, for example. But with this same concept, then you can start thinking, well, what would it mean to code using only two inputs, you know, left or right? And you can, it's kind of like, it's a nice exercise to be thinking about our interfaces in the way that we actually interact with our tools in general. And again, it's not necessarily that you would have to replace entirely writing with your eye movements, but even in terms of thinking about what it means to code, now we can ask you know, whatever, try to be DNA will spill out a good sample. But before, you'd be, okay, when you code, there's only a certain amount of things that you can do or there's only a certain amount of ways to define a function. So you would kind of have a list of the available things that you can do and then you could also have like a tool where, again, like the keyboard before, you would look at sides of the screen to select what you want to do. And then you can use like snippets on VS Code especially for React where if you want to create a function, then you would look to the side where the function option is. And then, you know, when you blink, it would almost like simulate that you're doing a tab and then you would write the component. So this works really well. Like it's kind of fun to be working with these coordinates. But at the same time it means that everything that I'm doing is that I as an engineer have to define what these gestures are. So for example, if you're using the hand and you do a thumbs up, it's like, well, how do you code what a thumbs up is? If, for example, you were doing like, oh, well, if I don't see the tips of my fingers because they're folded and let's say if the angle between my thumb and the rest of my hand is like, I don't know, 90 degrees, then it's a thumbs up. But then what happens if a user is like tilting their thumb a little bit more? All of a sudden it's not 90 degrees, even though you know I'm still doing a thumbs up. So it becomes a little bit more difficult to have something that can be shared by a lot of people.

5. Leveraging LLMs for Gesture-Based Interaction

Short description:

Exploring the use of LLMs for motion control and gesture-based interaction, drawing inspiration from the Room E Project by Jared Ficklin using Kinect sensors and gestures to control devices. Transitioning from Kinect to browser-based solutions in JavaScript to simplify interactions and enhance AI understanding of user gestures and commands.

So then I started wondering, okay, can LLMs be used for motion control? So I know that when you think about LLMs, you think about language, so text. But let's say more recently we're talking about multimodal AI. Actually that's not what I'm going to show. And so, yeah, it's like LLMs or multimodal LLMs, multimodal AI. It's like feeding or augmenting what you're giving the LLM, and we're starting with pictures. But the inspiration for what I'm going to show in this talk is something like this. It was called Room E Project by Jared Ficklin. It all goes all the way back to 2013. So I've been wanting to do this for 12 years.

And in this example here, there was actually a Kinect sensor. It's basically like a depth sensor. And in his example, he was looking at how tech could be added to almost like room-sized computing, and you would be able to communicate with devices, with voice and gestures. So in this example, there's two lights, and he was pointing at the light, and he was like, turn this light on, and then turn that light on. And he would kind of understand, based on gestures and what you're saying, which light to turn on. And it would understand the comments turn on or turn off. But in this example, again, it's using a Kinect.

And I think they don't even sell the Kinect anymore. So I wanted to see if we could do things with what's in the browser and all in JavaScript and something kind of like similar. So hopefully, if the demos go well, we should see that it is kind of getting closer to something that's possible. But again, it means that hopefully as well, there's not that much code involved. And now that people are getting used to interacting with AI anyway, we could get to a point where things like this would require maybe just a camera, and the rest, like AI, would kind of understand what you're trying to do, or it would learn about your gestures and the way you want to interact with interfaces rather than as an engineer, you would create an app and you say, oh, the only gesture you can use to turn the light on is pointing. And what if I don't want to point?

6. Investigating LLMs for Motion Control Experiences

Short description:

Exploring the integration of LLMs for motion control experiences, focusing on leveraging Gemiini for hand gesture-controlled interactions and researching various approaches, including multimodal AI and TensorFlow.js integration, towards achieving efficient motion control.

So, yeah, I mentioned multimodal AI. There was a few talks about AI today. So you probably know by now. But quickly, it's a machine learning system that can process multiple types of data, such as text, images, and sound. So here, because we're talking about gestures and we're going to use the webcam, it's mostly images that are then, you know, they're going to be a label that's going to be then fed into an LLM, and this is where the LLM comes in and the text input.

The purpose of this is to research how or if LLMs can be leveraged to create motion control experiences. And usually, when I do personal research, I also set myself a specific goal because you can go on forever in research and then, you know, at some point, I kind of want to move on to something else. So my goal is to control a website or an IoT device with hand movements using Gemiini. The only reason why I'm focusing on Gemiini specifically is because it's free, and I'm doing this not as part of my job. I'm not paid to do this. So it would be interesting to kind of switch and see if using other models would be more or less accurate or, you know, maybe there's different things that you can do.

But in this talk, it's going to be Gemiini specifically. So there's a few different approaches that you can take or at least that I decided to take when going through this. So first of all, I was thinking, if we rely only on multimodal AI, like, you know, first, let's try with that. So Gemiini takes screenshots and it returns its best guess for the hand gesture that I'm executing and then we can try to trigger stuff. The other approach is like, well, if I know TensorFlow.js, can I start to rely on multimodal AI and then augment it with more, like, fine-grained data and gestures from TensorFlow.js?

7. Exploring Gemiini for Gesture Recognition

Short description:

Exploring different approaches with Gemiini for gesture-based interactions and privacy considerations in using personal images for recognition. Utilizing Gemiini for function calling and LLM reliance, ongoing research on a custom gesture recognition system, and configuring Gemiini 2.0 Flash experimental for motion control experiments.

The other approach is like, well, if I know TensorFlow.js, can I start to rely on multimodal AI and then augment it with more, like, fine-grained data and gestures from TensorFlow.js? Okay. That would be, like, you know, the second approach. And then the third one would be to do all, like, DIY or at least the gesture recognition where I would take screenshots from the webcam, convert it into vector embeddings in JavaScript, save it in a custom vector DB, and then use Gemiini only for the function calling. So if you're thinking maybe about privacy, when I'm giving images of myself to Gemiini, I don't know where it goes. I don't know, you know, if it's used to train something else. And maybe you don't want that. So if you are creating your own custom system, you could still use Gemiini. When the label of your gesture is returned from your vector DB, you could then just rely on the LLM, like Gemiini, you wouldn't have to augment it with images and stuff. Full disclosure, this third approach, I'm still working on it, so I'm not going to be able to demo it today, but I think it's something that can work. It's ongoing research, so I wanted to mention it. Gesture detection with Gemiini. I used their starter repo, they have something called Live API, Web Console, a starter repo in React that sets the skeleton of how to use multimodal AI using Gemiini. My research focus is on the motion control part, so I didn't want to build something from scratch. If you're interested, in any other code sample that I'm going to show later, I'm going to fast over it or it's not enough, there's a starter repo to start from. Showing a few examples. At first, you have to start by configuring the model. I've been using Gemiini 2.0 Flash experimental. It's kind of an old one, as of I think two weeks ago. I think that now there's a new one that you can use, and I haven't tried it. I'd be curious to know if it makes the predictions better or faster or anything. For my experiments that I'm going to show you, this is the one that I'm using. Then you can set certain configs, tell the model if you want responses in text format or if you want it to talk to you. Hopefully, if everything is set up properly, you'll also hear the AI talk to me. The most important part is your system prompt. For my first experiment, I wrote something quite simple. You're my helpful assistant, looking only at the hand in the stream from the webcam, make your best guess about what the gesture means, don't ask for additional information, just make your best judgment. So we're going to start with that one.

8. Webcam-Based Gesture Recognition with Gemiini

Short description:

Exploring the system prompt and gesture recognition with Gemiini for webcam-based experiments, including triggering functionality with a specific word, observing the model's output, and experimenting with different gestures for labeling.

But the most important part of it is your system prompt. So for my first experiment, I wrote something quite simple. I said you're my helpful assistant, looking only at the hand in the stream from the webcam, make your best guess about what the gesture means, don't ask for additional information, just make your best judgment. So we're going to start with that one.

Okay. So here hopefully that should be zoomed in. So what's going to happen? I'm going to tell you what's going to happen, so then if it doesn't work, you'll know. I won't be able to hide. But on the left column here, there's going to be basically the output of the model and you should see, so in the center I'm going to have the webcam, I'm going to make a gesture in front of the camera and you should see the label of that gesture answered by Gemini. And also just to be precise, I'm going to say a trigger word to start the functionality.

In the end what I would love is for it to work without having to talk to it or you can programmatically send a start input, but I found that it was quite messy when I was doing this. So I'm just going to say the word now, but it could be anything. I could say beep boop or whatever. It's just a start command. So camera is on. I should be all connected. So now, I don't know if it was too small or something but it said thumbs up and peace sign. I'm not going to show too many of them, but then you can experiment being like, oh, what can Gemini by default? Not that. Don't see the mess on my screen. I'm going to just do that. What was cool here is that the only prompt that I said was just look at what you see and just give me a label.

9. Controlling Light States with Gemini Functions

Short description:

Exploring gesture variations and function calling with Gemini for controlling light states through defined function declarations and API interactions using LIFX light bulbs.

And there's a few gestures. I think it could also do thumbs down or when I do a stop, then it comes back with the label stop. So I haven't tried with so many gestures, but I'm like okay, we can start by if we only want to rely on Gemini, we can get some kind of labels. That's cool, but we want to do something with this gesture or this recognition.

So now we're moving onto function calling because we want something to happen. And here we have to start by declaring some function declarations which are schemas of like the way you're going to define or how your function is going to be kind of set up. So I have a function here. The name is going to be toggle light. I'm setting a few parameters and I'm saying that this function should have a parameter that's a type of object. And the properties it will have a property on that will be a Boolean and I'm just describing that this Boolean is going to be the light state as a Boolean and I make it required.

So Gemini should not call this function without the on property that should be a Boolean. And then here that's like more classic front end, but just so you know there's not really any magic there. I'm just sending, you know, if the comment on is true, then I'm sending to the API the string on or off. I'm using the LIFX light bulbs. At home I also have the Philips Hue light bulbs, but you have to connect them with a bridge to your router and obviously I can't do that if I'm speaking at a conference.

10. Interacting with Gemini and WiFi Light Bulbs

Short description:

Exploring WiFi-connected light bulbs and prompting Gemini to toggle light based on gestures without explicit instructions on the light state.

So these ones are really cool. As long as you're on the same network, everything, so I'm hotspotting off my phone, it works. And it has like little Wi-Fi chips and stuff in the light bulbs themselves. So here you just ping the API like you would do on your day-to-day job probably. And okay, and one important thing as well is that I have to update my prompt to the model, right? So when I came in my first prompt I just had tell me the gesture, but now I have to say call the toggle light function provided. So once you have the gesture, call the toggle light. But I'm kind of not saying anything else. And I'm just passing the function declarations after.

So I also wanted to experiment. What can I do with like a very small prompt? I don't have to like write super big stuff. So let's see if that works. It should. This one should. I know there's a demo that's probably going to fail, but we'll see. Okay, so what I'm going to do, I'm going to, okay. Should I tell you before or after? I'll tell you after. So okay. Now. Did it do it? Oh, okay. Now. Did it do it? Now. Okay. That's cool. I'm actually surprised because when I rehearsed earlier, it was a bit of a problem. So what's interesting in that one is that I did not tell Gemini which gesture I was going to make, and I did not tell it the thumbs up means on, thumbs down means off. I just said make a guess about your gesture, and then call the toggle light on. I didn't even say with the on property because that's defined in the schema. So what was interesting here is that a thumbs up is usually a positive gesture. So then again, I hate talking about AI in like human words, but I understood that I meant that, you know, it's a positive gesture, so the wheeling should be on. It's a negative gesture, so the wheeling should be off.

11. Enabling Multi-Light Control with ID and Position

Short description:

Exploring the setup for controlling multiple lights with specific IDs and positions to enable gesture-based light toggling.

Cool. Okay. So far so good. But what if I have multiple lights? At home, you probably don't have only one light. You might have multiple ones. So then, okay. What do I need to do then if I want to move on to being able to control multiple things? So this is where I move on to multiterm function calling. I saw it recently called compositional function calling. It has to be different somewhere. I do not know. But if you come across these terms, that just means calling multiple functions.

The first step that we have to take is in our toggle light function in the schema, we have to add a light ID, right? Because obviously to know which one that you're going to be able to toggle on, it's going to be either one or the other. So you need to pass in a light ID. That's going to be a number. The description is the idea of the light to toggle returned by the get light function. And here, we have to add another object to our schema. So we're going to add a function that's called get light.

And I'm going to add a property that's going to call details basically. And here, I'm saying that it's a string, and the description that I gave it is the detail about the light, usually a position. And the reason why I specifically in my experiment wanted to use a position is because I don't know the ideas of my light by heart. Even if I was talking to the AI instead of making gestures, I'm not going to say turn light ABC123, right? So I want to be able to say turn the right light on or turn the left light on.

12. Custom Light Names and Gestural Control Logic

Short description:

Setting custom names for lights, gestural control, and function calling logic based on prompts for multi-light interactions.

And if you have any of those lights at home, you know that usually when you're setting them up on the app, you can set a custom name. So obviously the light on the right is called right light, and the light on the left is called left light. So what I want to try to do is point in different directions. It should understand the way that I'm pointing, either left or right, and then that will be passed to the get light function, and it will filter through the names of my connected lights and then return the ID and pass it to the other function.

So I have to update the prompt again because now I have multiple lights. I'm saying look at which hand is making a gesture, right or left. I probably didn't need to add that, but just in case. And then call the get light function provided with the correct hand position. Then make your best guess about if the gesture means light on or off and call toggle light. I think what's interesting experimenting with this is that sometimes you change the prompt just a little bit and have totally different output for better or worse.

In terms of tool call logic, I'm going to go through this quite quickly because you can look at the code on the repo. But when you use the live API for Gemini, you have different hooks that you can use, and on tool call, so when the AI figures out that you're trying to call different functions, we have two functions here, get light and toggle light. Here it's going to list the functions that I'm supposed to be or that the AI is supposed to be calling based on the prompt that I gave it.

13. Live API Functions and Directional Light Control

Short description:

Exploring live API usage for Gemini, function handling based on prompts, and challenges in directional recognition during light control demo.

But when you use the live API for Gemini, you have different hooks that you can use, and on tool call, so when the AI figures out that you're trying to call different functions, we have two functions here, get light and toggle light. Here it's going to list the functions that I'm supposed to be or that the AI is supposed to be calling based on the prompt that I gave it. I have an array for the function response parts because the response from get light is going to be the ID. I'm going to have to pass it, it's going to have to pass it automatically to the toggle light function and things like that. So here I just loop through the function calls, I extract the name, the arguments, the ID, I just like call the function that it thinks should be the first based on the prompt again, and I'm pushing the response in the array, and then it's kind of like send the tool response. It's necessarily, you don't have to necessarily really understand exactly what is happening here, it's just you have to understand that the AI will figure out on its own which function to call at what time and how to pass the response to the other one based on the schema that I provided. So this is where it's probably, there's going to be probably an issue with that one and I'll explain why after. But the way that this demo should work is I'm going to point to the right, if I'm lucky, probably not, it should turn the right light on, if I point left, I should turn the left light on. It will probably not work because, weirdly, they are the, whatever, like vision model or whatever they use in the background is not very good at knowing right from left, which I thought was the problem that was kind of solved in computer vision. But let's say that if their model is more kind of tuned to detecting objects and not necessarily body parts, that might be why, but let's see what happens. So if this one fails, I kind of already know, so it's fine.

Now... Okay. I have toggled the lights off. Well, I mean, I didn't want them off, but, okay. Now... I have toggled the lights on. But it was saying on. Oh, well, okay. How are the lights now? No, no, no, no, no, okay. So the thing is, I think I was, I don't even remember which finger I was pointing with. But it turned on a light. So I don't know if it figured out right or left, but it did a thing. But again, this is where it's actually interesting that if you're trying to, so if I just talk to the AI, let's say, let me try something. Turn the left light on. No. Okay. I'm just going to move on with that one. So let me say that. Okay.

14. Integrating TensorFlow.js for Color Control

Short description:

Adding TensorFlow.js to Gemiini for color control based on hand gestures.

This is not the most interesting one. So again, maybe later on or if at some point we can, you know, I think at some point I was thinking, oh, this is where I could augment it with TensorFlow.js, where TensorFlow would know the right from left and then I could augment that. But this is where I'm going to move on to something else. So let's add TensorFlow.js to Gemiini.

So here I'm not necessarily, I'm building each demo on top of each other. So we still have our get light function, we still have our toggle light, but I want to set the color. So I want to use fine grain data to be able to rotate my hands and change the color of the light. I'm just saying, I'm just adding to my prompt. Then if the user is making a palm gesture, again, I'm not describing what a palm gesture is. I said if you see a palm, call the set color function provided.

I'm not going to go through what the set color function is because TensorFlow.js just detects the hands and I do like what I'm doing is I'm calculating the angle between the tip of my thumb and the tip of my pinky. And if you imagine like a horizontal line as I'm moving in like degrees, basically, I'm changing the hue of the light bulb. I mean, that's what is supposed to happen. OK, so let's see if that works. So this is actually a really exciting one. That's why I wanted it to work.

15. Color Changes with Hand Gestures

Short description:

Exploring color changes in lights based on hand gestures using JavaScript with success and potential improvements.

So just to explain before I do it, the thing is I might have to OK, I kind of want to see if things are happening. So I might have to do that. Do I? Yeah. OK. So what I want to do is I want to build on the demo before. So I want to point to a light. If I'm lucky, it will turn the correct light on. If it doesn't, I think I can still move on. Then I'm going to do the palm. And as I'm rotating my hand, it should change the color. I'm just going to open the console because if it doesn't work, I'll know. And because it's JavaScript, I'll just refresh.

OK. Now? OK, I've toggled the light on your right. Now? I have set the color of the light on your right to red. Woo! Woo! Oh, yeah! OK, I'm so happy. All right, cool. So it worked! I shouldn't be that surprised. But because it's generative, you also don't know if it's going to work or not. So that was cool. So it means that you can rely on what Gemini gives you out of the box, and then you can also customize with making more interesting interfaces.

It was struggling a little bit. I don't know if you saw, it was kind of slow. It's not usually that slow. So I don't know if it's just like I need to update the model or maybe my connection is not good. I'm not sure. I'm over time, so I'm going to rush a little bit. So I'm not going to demo that one, but I want to talk about how I would necessarily do it. So I explained it a little bit at first, but I've been using a platform called DataStax, and you could have, again, because you can see the stream from the webcam in the browser, you can then take screenshots, transform that into vector embeddings. That will be stored then in a DataStax vector database with a description of my gesture.

16. Custom Gesture Databases and Autonomous Interfaces

Short description:

Experimenting with custom gesture databases for privacy and personalized gesture recognition with Gemini, aiming for autonomous and intuitive interfaces.

So I'm kind of like creating my own custom database of gestures, but it doesn't mean that it would only work on me. It means that, you know, then if someone else is using my app and somehow, you know, transforming into vectors, there's a similarity between a gesture that I already have saved. Then you could return the description of that gesture and then feed that to Gemini. So based on the description that you get, do something. And the reason why I wanted to experiment with this is, again, with Google, I don't know where my stuff goes. I don't know if they keep it. I don't know if they don't keep it. You also kind of get, I know that it's very difficult to recreate an image out of vector embeddings, so, for privacy, I feel a little better if I end up installing that in my house and it just looks at me all the time. I would rather know that I'm sending stuff to my custom DB rather than to Google. And also, with the Gemini, if I'm doing a gesture that Gemini, by default, does not recognize, let's say it doesn't recognize. I think it didn't recognize, like, the Star Trek whatever sign. And it means that I can train it. I can tell Gemini, hey, when I do this, turn the lights off. If I refresh, then I'm losing it because it doesn't necessarily save your sample, at least not the free version. So, if you do it, like, if you do this but you serve it into your custom database, then it will always be there.

Anyway, so I'm getting to the end of this talk, so I added a few resources, but if you have any questions and you want any kind of link, you can reach out to me on Blue Sky, probably more than Twitter. And I'm planning at some point to write a blog post about this, but because it's ongoing and it will be there, but if you're interested, my website is challenge.jar.dev. So there is already existing research of people who are actually researchers who have published papers about this. And otherwise, I just wanted to kind of, like, end again on the original inspiration. What I love about experimenting with this is that I feel like it's getting closer to something that could actually be happening only with a laptop. And it means that I really want people at some point to be able to, I don't know, turn these lights on, and then at home, they just talk to kind of, I don't know, whatever AI device that they have, and they get to customize the way that they want to experiment with stuff. And in the end, we could have interfaces where you're just like, boom, boom, boom, boom. And it just learns what you do. And maybe you don't even have to talk to it. It just learns how you work. And you don't have to buy a Kinect. You don't have to have, like, this really big setup. But it means that already 13 years ago, people were thinking about this. I'm still thinking about this. And hopefully, maybe at some point, it can actually be embedded in the browser, because now Gemini API is in the browser as well.

Personalized Interfaces and Motion Control

Short description:

Exploring interfaces with motion control possibilities for personalized interactions and encouraging experimentation with AI tools.

And we could be hoping, here I showed with lights, but it does work as well with interfaces. So I was thinking, oh, when I do this, refresh. When I do this, change tabs. And if it ends up being then in the browser natively, then people can build their own interfaces. Anyway, I don't know how much of a time I am, but I'm definitely over time. So thank you very much for listening. And hopefully, you want to experiment with stuff like that as well. We have a very young person who's joining us probably on the livestream, because I haven't seen many kids in the room today. So, as a 10-year-old, how do you suggest I start to learn about these kind of stuff? I mean, well, it depends. It depends if the person is already a programmer or not. Hm. There's so many ways that you can do motion detection. The thing with multimodal AI here is that you can go to the Gemini AI studio right now, and it's an interface that looks kind of similar to that. And you can talk to the AI and say, like, what do you see? What do you see? And then you can experiment with that. And if the person knows how to write code, then you can, you know, get a response back from Gemini and decide to not necessarily turn on lights if you're 10 and you're just getting started. But it could be, like, I don't know, clicking on a button in an interface or something. So really, I don't know. I don't know anything about this person, so I don't know if they know how to code or not. But there's a lot of different tools. So, yeah. Sorry. Well, I mean, the first answer is joining apparently the JS Nation livestream and asking these questions. Really cool. All right. I have such a hard time always, you know, deciding which one to do first. So I'll just go by most upvoted. Is the code open source? Or can you go over the set caller function again, please? So the code is not yet open source, because it's a big mess. If I was doing it open source right now, it would probably be, yeah, a bit hard for people to even know what's going on. Because I'm getting lost in it. But at some point I will.

TensorFlow.js Set Caller Function and Gemini Nano

Short description:

Discussing the set caller function in TensorFlow.js for hand gesture detection and color manipulation, exploring potential experiments with local models and the unknown Gemini Nano.

I will want to. So I think when I finish the last demo about doing it yourself, I want to open source it, because I want people to see. But I think the rest of the question, okay, can I go over the set caller function? So what the set caller function does, in the same code you import the models from TensorFlow.js. And there's a function that's called estimate hands that triggers the hand detection. And from that, you're getting an array of key points. And what I filtered to only extract the key point that was the tip of the thumb and the tip of the index. So I get x and y coordinates all the time about the position of these two points on the screen. And then I'm just doing some kind of math, just a little bit. I'm calculating the angle between the imaginary horizontal line, the angle here, so that when I start with my hand here, basically the angle is zero, because I'm not at an angle. And then as I rotate here, I basically calculate the angle here. And I'm applying that to a hue value to change the color. So... Yeah.

All right. Cool. Thank you. Did you try the same experiments with local models? I have not. So at the moment I really just experimented with the available Gemini models. So I haven't tried with anything else. I would be curious. At some point maybe I will, to kind of, like, know the difference. And I'd be curious to know if other models, like, recognize more or less gestures by default. But I haven't yet. No. All right.

Maybe a little bit of the same ilk, is it possible to use Gemini Nano instead? I have no idea. To be honest, I don't even know what Gemini Nano is. So now I have to go and learn. Yeah. Perhaps.

Facial Recognition Enhancement and Custom Gestures

Short description:

Discussing Gemini Nano, enhancing facial recognition for labeled gestures, and creating custom gestures like eyebrow-controlled camera zoom.

Or people will join you at your Q&A spot later on. Sure. Someone else can tell me what Gemini Nano is. Or they have just heard of it as well. Very cool.

Like, can we enhance facial recognition to extract more labeled gestures? Oh! So it means if you... I mean, I don't know who has the question. But is it, like, can you recognize, like, a smile or frowning? Yes. So at some point I built a silly demo that someone was asking me, like, oh, can you do something where, for, like, streamers, where when you raise your eyebrow, it zooms in on the camera? You know, when you make a silly gesture and it would be like zzz.

And then when you raise the other eyebrow it would zoom out. And so using row key points from different places over the eyebrow, I calculate the angle of my, like, when you raise your eyebrow like this. And then you apply it to, like, zoom so that, you know, you could have a nice effect where you look into the camera and then you raise your eyebrow, it's like zzz, like, anyway. So you can, I mean, you get row key points. So you decide how you want to kind of detect different gestures. Yeah. You could do a custom blink with the right eye or the left eye, or both eyes, or a smirk, or, yeah. You decide that. So. You decide. Yeah.

Open Source Considerations and Latency Challenges

Short description:

Discussing open-sourcing projects, issues with latency in gesture detection, and concerns about the performance of generative AI models.

Very cool. You mentioned, or the question earlier was if you plan to open source any of this. And then there is, of course, the question also, if you're, if you'd be looking for any contributors and maybe specifically in what areas would you look for contributions? So I usually don't. Not because I don't like people. But it's just that I do that very sporadically. So it's only when I have time, again, that's not my job. So it's hard to have a contributor if, you know, someone wants to build a specific thing. So what I'm doing right now is not open source. But all the other things I showed at the beginning of the talk with TensorFlow.js, these ones are all open source. So if we're talking about contributing in terms of opening a PR or something, sure. Most of the time I miss the notifications. But it's just hard with the way that I'm, like, because it's not my job, I just do it so like not that often that it probably would be boring if someone wanted to contribute with me. So, yeah. I mean, I feel like that's the case for most maintainers. But yeah. Thank you for being considerate.

What are the best ways to reduce latency between the gesture detection and action? So if you do it with TensorFlow.js, it's not really something that you control. It's like 60 frames per second with the key points. For the generative AI, for that, I don't know. Because where it was really a bit slow today, it actually usually is not really that slow. So again, I don't know if it has to do with maybe a lot of people are, like, slamming that model right now. You know, sometimes when ChadGBT is like, oh, we're a bit slow because everybody is using us or something. At least the free version. So I don't know if today it was a connection or if it was just, like, it struggled somehow. I think this is where the parts of generative AI where I'm not quite sure what's going on. So I assume or I hope that as models get better, or maybe I would have to pay for a version that's more performant, it would be better. Because at the moment, if I do that and it takes a few seconds, you know, you don't apply it to something that's time sensitive. I think that we're still kind of like the early age. And I would love to know in a few years if it's, like, really as let's say as fast as you get things with AngularJS.

Future Interface Speed and Acoustic Recognition

Short description:

Discussing the speed of future advancements in interfaces, real-world use cases with acoustic activity recognition, and personalized AI learning from sounds and activities.

And I would love to know in a few years if it's, like, really as let's say as fast as you get things with AngularJS. If it was really... Yeah, if you were getting your labels and everything a lot faster, then you could really build cool interfaces. So, yeah. Very nice.

Let's do one last question. That starts with a compliment. Really nice presentation. Really lovely presentation. Are there any real-world use cases that you would like to see? Yes. I mean, because I've been... So my two inspirations were the one that I showed. So room E, where... And there was another paper that didn't necessarily use, like, video or gestures. But it was based on a sound. It was a project by Apple called Listen Learner, where it was a custom device that was listening to activities around you. And it was using what's called acoustic activity recognition.

Everything that you do makes a certain sound. Like, when I open this bottle, it broke the thing and I recognized the sound. Or when you open the door of your fridge, you recognize that sound. Or when your coffee pot is done, you recognize that sound. And you can... In that project, the person was just in their apartment, like, cutting things on a chalkboard or whatever. And the AI was clustering by itself different sounds. And if it had not heard that sound before, it was asking the user, like, oh, what is that sound? And then you just say it once, you label it once, like, I'm cutting carrots or whatever.

Personalized AI Assistance and Future Systems

Short description:

Discussing the future of customized systems based on acoustic and activity recognition, learning user behaviors without explicit training, and the desire for AI to proactively assist in daily tasks.

And it would then cluster by itself similar sounds. And it means that then you could have a system that knows what you're doing without having to train it with millions of samples. And it would be very customized, because it's based on what's happening around you. And if you mix that acoustic recognition or activity recognition with also the recognition of movements, then you would have potentially home systems that really learn how you behave and they would maybe know in advance what you're trying to do.

It would learn from your habits of, like, what time certain sounds kind of happen and what the next sound usually is or the previous sound and it would learn your gestures without you having to say anything. And I think I'm, like, I want to live in a world where that happens. I feel like even now I was thinking with the AI that we have now, we don't have something as simple as, like, I even want something that looks at my calendar and it sees that I have an interview next week and I want the AI by itself to know, like, do you want me to help you with interview questions?

Or, like, I can see that you have something coming up next week. And it doesn't even do that. Or maybe I'm not, like, using a tool that does that, but it's like I want the AI to learn about me without me having to, like, say anything. Maybe at some point we'll get there, but I feel like it's definitely not there. The more I experimented with this, the more I realized we're so far from AGI. Like if I do this and it thinks it's my left hand and then it doesn't even turn the light on but it tells me I turned the light on, it's like, no, you didn't.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Building a Voice-Enabled AI Assistant With Javascript

JSNation 2023

21 min

Building a Voice-Enabled AI Assistant With Javascript

Top Content

Tejas Kumar

Author of the "Fluent React" bestselling book, software engineer with 23 years of experience, and host of the developer-loved ConTejas Code podcast.

This Talk discusses building a voice-activated AI assistant using web APIs and JavaScript. It covers using the Web Speech API for speech recognition and the speech synthesis API for text to speech. The speaker demonstrates how to communicate with the Open AI API and handle the response. The Talk also explores enabling speech recognition and addressing the user. The speaker concludes by mentioning the possibility of creating a product out of the project and using Tauri for native desktop-like experiences.

case study artificial intelligence

The Ai-Assisted Developer Workflow: Build Faster and Smarter Today

JSNation US 2024

31 min

The Ai-Assisted Developer Workflow: Build Faster and Smarter Today

Top Content

Addy Osmani

Engineering Leader Working on Google Chrome

AI is transforming software engineering by using agents to help with coding. Agents can autonomously complete tasks and make decisions based on data. Collaborative AI and automation are opening new possibilities in code generation. Bolt is a powerful tool for troubleshooting, bug fixing, and authentication. Code generation tools like Copilot and Cursor provide support for selecting models and codebase awareness. Cline is a useful extension for website inspection and testing. Guidelines for coding with agents include defining requirements, choosing the right model, and frequent testing. Clear and concise instructions are crucial in AI-generated code. Experienced engineers are still necessary in understanding architecture and problem-solving. Energy consumption insights and sustainability are discussed in the Talk.

artificial intelligence

The Rise of the AI Engineer

React Summit US 2023

30 min

The Rise of the AI Engineer

Top Content

Watch video: The Rise of the AI Engineer

Shawn Swyx Wang

Latent.Space Editor & Smol.ai Founder

The rise of AI engineers is driven by the demand for AI and the emergence of ML research and engineering organizations. Start-ups are leveraging AI through APIs, resulting in a time-to-market advantage. The future of AI engineering holds promising results, with a focus on AI UX and the role of AI agents. Equity in AI and the central problems of AI engineering require collective efforts to address. The day-to-day life of an AI engineer involves working on products or infrastructure and dealing with specialties and tools specific to the field.

web development artificial intelligence builders and founders future of development

AI and Web Development: Hype or Reality

JSNation 2023

24 min

AI and Web Development: Hype or Reality

Top Content

Wes Bos

Full Stack Developer, Speaker & Teacher, Co-host of Syntax.fm podcast.

This talk explores the use of AI in web development, including tools like GitHub Copilot and Fig for CLI commands. AI can generate boilerplate code, provide context-aware solutions, and generate dummy data. It can also assist with CSS selectors and regexes, and be integrated into applications. AI is used to enhance the podcast experience by transcribing episodes and providing JSON data. The talk also discusses formatting AI output, crafting requests, and analyzing embeddings for similarity.

productivity artificial intelligence

Web Apps of the Future With Web AI

JSNation 2024

32 min

Web Apps of the Future With Web AI

Jason Mayes

Web AI Lead at Google.

Web AI in JavaScript allows for running machine learning models client-side in a web browser, offering advantages such as privacy, offline capabilities, low latency, and cost savings. Various AI models can be used for tasks like background blur, text toxicity detection, 3D data extraction, face mesh recognition, hand tracking, pose detection, and body segmentation. JavaScript libraries like MediaPipe LLM inference API and Visual Blocks facilitate the use of AI models. Web AI is in its early stages but has the potential to revolutionize web experiences and improve accessibility.

artificial intelligence

Code coverage with AI

TestJS Summit 2023

8 min

Code coverage with AI

Premium

Jaap Brasser

Codium

Codium is a generative AI assistant for software development that offers code explanation, test generation, and collaboration features. It can generate tests for a GraphQL API in VS Code, improve code coverage, and even document tests. Codium allows analyzing specific code lines, generating tests based on existing ones, and answering code-related questions. It can also provide suggestions for code improvement, help with code refactoring, and assist with writing commit messages.

artificial intelligence

Workshops on related topic

AI on Demand: Serverless AI

DevOps.js Conf 2024

163 min

AI on Demand: Serverless AI

Top Content

Featured WorkshopFree

Nathan Disidore

In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.

serverless architecture artificial intelligence

AI for React Developers

React Advanced 2024

142 min

AI for React Developers

Top Content

Featured Workshop

Eve Porcello

Knowledge of AI tooling is critical for future-proofing the careers of React developers, and the Vercel suite of AI tools is an approachable on-ramp. In this course, we’ll take a closer look at the Vercel AI SDK and how this can help React developers build streaming interfaces with JavaScript and Next.js. We’ll also incorporate additional 3rd party APIs to build and deploy a music visualization app.
Topics:- Creating a React Project with Next.js- Choosing a LLM- Customizing Streaming Interfaces- Building Routes- Creating and Generating Components - Using Hooks (useChat, useCompletion, useActions, etc)

react next.js artificial intelligence

Building Full Stack Apps With Cursor

JSNation 2025

46 min

Building Full Stack Apps With Cursor

Featured Workshop

Mike Mikula

In this workshop I’ll cover a repeatable process on how to spin up full stack apps in Cursor. Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development. We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more! By the end expect to be able to run your own AI generated full stack app on your machine!
Please, find the FAQ here

artificial intelligence

Vibe coding with Cline

JSNation 2025

64 min

Vibe coding with Cline

Featured Workshop

Nik Pash

The way we write code is fundamentally changing. Instead of getting stuck in nested loops and implementation details, imagine focusing purely on architecture and creative problem-solving while your AI pair programmer handles the execution. In this hands-on workshop, I'll show you how to leverage Cline (an autonomous coding agent that recently hit 1M VS Code downloads) to dramatically accelerate your development workflow through a practice we call "vibe coding" - where humans focus on high-level thinking and AI handles the implementation.You'll discover:The fundamental principles of "vibe coding" and how it differs from traditional developmentHow to architect solutions at a high level and have AI implement them accuratelyLive demo: Building a production-grade caching system in Go that saved us $500/weekTechniques for using AI to understand complex codebases in minutes instead of hoursBest practices for prompting AI agents to get exactly the code you wantCommon pitfalls to avoid when working with AI coding assistantsStrategies for using AI to accelerate learning and reduce dependency on senior engineersHow to effectively combine human creativity with AI implementation capabilitiesWhether you're a junior developer looking to accelerate your learning or a senior engineer wanting to optimize your workflow, you'll leave this workshop with practical experience in AI-assisted development that you can immediately apply to your projects. Through live coding demos and hands-on exercises, you'll learn how to leverage Cline to write better code faster while focusing on what matters - solving real problems.

artificial intelligence

Free webinar: Building Full Stack Apps With Cursor

Productivity Conf for Devs and Tech Leaders

71 min

Free webinar: Building Full Stack Apps With Cursor

Top Content

WorkshopFree

Mike Mikula

In this webinar I’ll cover a repeatable process on how to spin up full stack apps in Cursor. Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development. We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more! By the end expect to be able to run your own ai generated full stack app on your machine!

fullstack artificial intelligence

Working With OpenAI and Prompt Engineering for React Developers

React Advanced 2023

98 min

Working With OpenAI and Prompt Engineering for React Developers

Top Content

Workshop

Richard Moss

In this workshop we'll take a tour of applied AI from the perspective of front end developers, zooming in on the emerging best practices when it comes to working with LLMs to build great products. This workshop is based on learnings from working with the OpenAI API from its debut last November to build out a working MVP which became PowerModeAI (A customer facing ideation and slide creation tool).
In the workshop they'll be a mix of presentation and hands on exercises to cover topics including:
- GPT fundamentals- Pitfalls of LLMs- Prompt engineering best practices and techniques- Using the playground effectively- Installing and configuring the OpenAI SDK- Approaches to working with the API and prompt management- Implementing the API to build an AI powered customer facing application- Fine tuning and embeddings- Emerging best practice on LLMOps

artificial intelligence openai react and ai