Video Summary and Transcription
MediaPipe is a cross-platform framework that helps build perception pipelines using machine learning models. It offers ready-to-use solutions for various applications, such as selfie segmentation, face mesh, object detection, hand tracking, and more. MediaPipe can be integrated with React using NPM modules provided by the MediaPipe team. The demonstration showcases the implementation of face mesh and selfie segmentation solutions. MediaPipe enables the creation of amazing applications without needing to understand the underlying computer vision or machine learning processes.
1. Introduction to MediaPipe and Machine Learning
Hello, everyone. I'm Shivay Lamba. I'm currently a Google Snowfold Mentor at MediaPipe, and I'm going to be talking at React Advanced. So excited to be speaking at React Advanced on the topic of using MediaPipe to create cross-platform machine learning applications with React. Machine learning is literally everywhere today, and it's important to use it in web applications as well. MediaPipe is Google's open source cross-platform framework that helps build perception pipelines using machine learning models. It can process audio, video, image-based data, and sensor data, and includes features like end-to-end acceleration.
Hello, everyone. I'm Shivay Lamba. I'm currently a Google Snowfold Mentor at MediaPipe, and I'm going to be talking at React Advanced. So excited to be speaking at React Advanced on the topic of using MediaPipe to create cross-platform machine learning applications with React.
So a lot of this talk is going to be centering around machine learning, MediaPipe, and how you can integrate, basically, MediaPipe with React to create really amazing applications.
So without wasting any further time, let's get started.
The first thing, of course, I mean, today, machine learning is literally everywhere. You look at any kind of an application, you'll see machine learning being used there. Whether it's education, healthcare, fitness, or mining, for the sake of it. You'll find the application of machine learning today in each and every industry that is known to humankind.
So that makes machine learning so much more important to also be used in web applications as well. And today, as more and more web applications are getting into the market, we are seeing a lot more of the machine learning use cases within web applications as well.
And let's actually look at a few of these examples that we can see. For example, over here we can see a face detection happening inside of the Android. Then you can see the hands getting detected in this iPhone XR image. Then you can see the Nest Cam that everyone knows is a security camera. Then you can see some of these web effects where you can see this lady and she has some facial effects happening on her face using the web. Or you can also see the Raspberry Pi and other such kind of micro based microchips or such kind of devices that run on the edge.
And what are the things in common in all of these? That's the question. So the thing that is common in all of these is media pipe.
So what exactly is media pipe? Media pipe is essentially Google's open source cross-platform framework that actually helps you to build different kinds of perception pipelines. What that means is that we are able to basically build or use multiple machine learning models and use them in a single end-to-end pipeline to let's say build something. And we'll also look at some of the common use cases very soon.
And it has been previously used widely in a lot of the research-based products at Google. But now it has been made upstream. And now everyone can actually use it since it's an open source project. And it can be used to process any kind of an audio, video, image-based data and also sensor data. And it helps primarily with two things. One is the data set preparation for different kinds of pipelines within machine learning and also building basically end-to-end machine learning pipelines. And some of the features that are included within MediaPipe include end-to-end acceleration because everything is actually happening on-device.
2. MediaPipe Solutions and Real-World Examples
MediaPipe is a cross-platform-based framework that offers ready-to-use solutions for various applications. Some of the solutions include selfie segmentation, face mesh with over 400 facial landmarks, hair segmentation, object detection and tracking, facial detection, hand tracking, human pose detection and tracking, holistic tracking, and 3D object detection. These end-to-end solutions are popular and have real-world applications in AR, movie filters, Google Lens, and augmented faces. A live perception example demonstrates hand tracking using landmarks to denote the edges of the hand.
Then secondly is that you just have to actually build it once and different kinds of solutions including Python, JavaScript, Android, iOS, all those can actually be used. So you just have to build it once and you can use it on different types of platforms. That is why we are calling it a cross-platform-based framework.
And then these are just ready-to-use solutions. You just have to import them and integrate them into your code and it will be very easily used. And the best part about it is that it is open-source. So all the different kinds of solutions, all different codebases you can find on the MediaPipe repository on Google's organization on GitHub.
Now, looking at some of the most commonly used solutions, some of the most well-known solutions include the selfie segmentation solution that basically, you know, is also actually being used in Google Meet where you can see the different kind of backgrounds that you can actually apply, the blurring effect. So what it does is that it uses segmentation mask to only detect the humans in the scene and it is able to extract only the information needed for the humans. And then we have Face Mesh that basically has more than 400 plus facial landmarks that you can put, and you can make a lot of different interesting applications using this. For example, let's say AR filters or makeup, right? Then we have hair segmentation that allows only you to segment out the hair. Then we have a standard computer vision based algorithms like object detection and tracking that you can do to detect specific objects. Then we have facial detection, we also have hand tracking that can track your hands and you can probably use it for things like, you know, being able to use hand-based gestures to control, let's say, your web application. Then we have the entire human pose detection and tracking that you could probably use to create some kind of a fitness application or a dance application that can actually track you. Then we have the holistic tracking that actually tracks your entire body, right? And it tracks your face, your hands, your entire pose, right? So it's a combination of basically the human pose, hand tracking and the face mesh. Then we have some more advanced object detection, like the 3D detection that can help you to detect, you know, bigger objects like a chair, shoes, table. And then we have a lot more other kinds of solutions that you can actually go ahead and look at. And these are all end-to-end solutions that you can directly just implement. That is why MediaByte solutions are so popular.
And just to look at some of the real-world examples where it's being actually used. We just spoke about the face mesh solution that you can see over here, you know, taking place on the AR Lipstick try-on that is there on YouTube. Then we have the AR-based movie filter that can be used directly in YouTube. Then we have some basically Google Lens surfaces that you can see like augmented reality taking place. Then you can also see it also being used not only like in these augmented reality or like these kind of things, but also in like more other kinds of inferences, like the Google Lens translation, that also does use the MediaByte pipelines in its packet. And you can see like augmented faces that again is based on the face mesh. So let's look at a very quick live perception example of how basically you know, it actually takes place. For this, what we're going to be doing is we're going to be looking at the hand tracking, right? So essentially what we want to do is that we take an image or a video of your hand and we're able to put these landmarks. What are landmarks? Basically, landmarks are these dots that you see and you can superimpose them on your hand and they sort of denote all the different uh, you know, like you could say the different edges of the, of your hand and you're going to be superimposing them. So this is what the example is going to be looking like. So how would that simple perception pipeline look like? So essentially, first you'll take your video input.
3. Integrating MediaPipe with React
In MediaPipe, frames from videos are broken down into tensors, which are high-dimensional numerical lattices that contain machine learning information. These tensors undergo geometric transformations and machine learning inference to create landmarks that are rendered on top of the image. MediaPipe uses graphs and calculators to represent the perception pipeline, with calculators serving as the main brain behind the solution. Documentation and visualization tools are available to explore and understand the calculators graph. MediaPipe can be integrated with React using NPM modules provided by the MediaPipe team, such as face mesh, face detection, hand tracking, 3D object detection, pose estimation, and selfie segmentation. Real-world examples and code snippets are available for reference. The integration involves importing React, webcam, and the desired MediaPipe solution, such as selfie segmentation, and using the webcam and canvas elements to process and render the landmarks.
Then basically you'll be able to you know, reduce the frame, basically you'll be getting the frames from your video and you'll be breaking down that entire frame into a size that is usable by the tensors. because internally media pipe uses a TF light that is TensorFlow light. So you're working with tensors. These are high dimensional numerical lattice that basically contain your entire information about you know that machine learning. So basically you'll be doing a geometric transformation of your frame into a size that has been used by the tensors. So this images will get in, will get transformed into the mathematical format of the tensors and then you'll run the machine learning inference on those. Basically you'll be doing some high level decoding of the tensors and basically that will result in the creation of the landmarks and then you'll be rendering that landmark on top of the image and you'll get that output. So essentially what will happen is that if you have your hand and you import those landmarks on top of it, you'll finally get this result that you see is basically the handwriting. So this way we can build such kind of pipelines and basically what's happening behind the scenes are under the hood is that we have the concept of graphs and calculators so if you are aware of like the graph data structure of how the graph has like edges and vertices, similarly a media pipe graph also works in a similar manner that whenever you're creating any kind of a perception pipeline or a media pipe pipeline, right, so basically it's consisting of you could say the graph like in the nodes and the edges and where the node specifically denotes the calculated. Now essentially the calculators are the C++ configuration files that essentially store what exactly or what exact kind of transformation or what is the main brain, you could think of the calculator as the main brain behind that solution that you're implementing and then essentially these are the nodes and the data which actually comes into the node and is processed and comes out of the node, all of those connections via the edges are, you know, sort of what is representing the entire media pipe graph. So including the edges and then what is the input port at the calculator and what is output. So input is what is coming into the calculator and when once the calculations have been done, once the transformations have been done, what's coming outside. So essentially that is how you can think of the entire perception pipeline of like using different kinds of calculators together to form let's say one particular solution and all of that will be represented through this media pipe graph. So that's essentially what is the back end or what's going on behind any kind of this backend structure of a metapipe solution. Now you can also look at some of the docs to get to know more about these, you know, about calculators graph by going into docs.metapipe.dev or you can also actually visualize different types of perception pipelines. Let's say the one that we use was actually a very simple one where we were just using it to detect the landmarks on your hand, but if you have much more complex pipelines, you can actually go ahead and use vis.metapipe.dev to visit that and look at some of the pipelines that are there to offer on this particular site.
And now coming to the essential part, what this talk is really all about, and that is how can you integrate MediaPipe with React? So there are a lot of NPM modules that are shared by the MediaPipe Google team, and some of these include basically face mesh, face detection, hands, basically the hand tracking, holistic that is having the face mesh, hand, and your pose. Then Objectron that is the 3D object detection, and then you have the pose, and you have selfie segmentation that we had to cover, is basically how the Zoom or the Google Meet background sort of works. So for all of these, you will find the relevant NPM packages, and you can refer to this particular slide, and you can also look at the real-world examples that have been provided by the MediaPipe team. These are available on CodePen, so you can refer to any of these to look at how basically that has been implemented. But what we are going to be doing is we are going to be specifically implementing this directly in React. So here is a brief example of how it's supposed to be working. So in the first piece of code that you can see at the top, where we have basically integrated or we have imported React, we have also imported the webcam because the input stream that we are going to be putting up is with the help of the webcam that we are going to be using, so we have just integrated the webcam. Then we have integrated one of the solutions over here as an example, and that is the selfie segmentation solution that you can see where we have imported from the MediaPipe selfie segmentation NPM model. And we have also integrated the MediaPipe camera utils, so this is to basically fetch the details from the camera, right? We do also have some other utils that help you to actually create the landmarks which will discover in a bit. But after that, you can see basically the code where we have used the actual MediaPipe selfie segmentation, and again, the best part about this is that you're not supposed to be writing 100, 200, 300 lines of machine learning code, and that's the benefit of using MediaPipe solutions that everything is packed into this code, and you're doing such kind of, you know, like important and that's kind of essential machine learning based things like object detection, object tracking that usually run into like 200, 300 lines of code. And you can simply just put it in less than 20 to 30 lines of code. Over here we've just simply, you know, created our function for the selfie segmentation where we are using the webcam as a reference. And we are using on top of that a canvas as a reference because the webcam is sort of the base, right? You get your frames from the webcam and then you're using the canvas element on top of it to render the landmarks. Right? And over here you can see that we are just implementing the CDN to get the MediaPike selfie segmentation solution.
4. Demonstration of Face Mesh Solution
We'll move quickly into the code with the demonstration time. I have implemented a very simple react application. It includes examples of hands, face mesh, holistic, and selfie segmentation solutions. I'll be demonstrating the face mesh solution. We use the new camera object to get the reference of the webcam and send that frame to our face mesh, where it can render the landmarks on top of the actual frame.
And then we are rendering the solutions. We're reading the results on top of whatever is being detected. But yeah, I mean like so far, it's all been sort of discussion. We'll move quickly into the code with the demonstration time. So if everyone is excited, I'm more than happy to now share the demonstration for this.
So let me go back to my VS code. So over here, basically, I have implemented a very simple react application. It's a simple create react application, right, that you can use very simply. You can find it on the Facebook documentation. Over here in my main app.js code, what I've done is that I have integrated four different types of examples. Right? So these four examples include the hands, the face mesh, the holistic, and the selfie segmentation solution. So I'll be quickly showing you the demos of all of these. But I'm just going to be quickly demonstrating how easy it is, you know, to be able to integrate such kind of a media pipe solution or like machine learning solution, right?
So within my function in my app, I have for now commented out all the other solutions, the first one that I'll probably demonstrate is a face mesh. So I have imported all the components for each one of these. And currently, I'm just, you know, rendering or I'm returning the face mesh components. So if I go very quickly to my face mesh component, over here as we walk through the code, you can see that I've integrated imported react. I have imported some of the landmarks. Now, basically, whenever we are talking about let's say the face, right? We want our right eye, left eye, eyebrow, you know, our lips, our nose and all. So these are basically all the ones that we have imported specifically from the face mesh. And then we have created our function, right? We have created basically the MP face mesh that will be used to render the face mesh on top of the, on top of our, you know, on our webcam. So over here, we have just to return again the cdn. And we are using the face mesh.onResults to render the result that we will see. So we start off by basically getting the camera. So we use the new camera object to get the reference of the webcam. And using that, we, what we do is that we wait, we have basically created the async function for, so that, you know, since like the machine learning model itself that will be loaded can take some amount of time to load. That is why we just, we have just used an async function to wait for the you know, for the landmarks to actually load. And that is why we send to the face mesh the webcam reference that is basically your current frame that you're using. So once your camera loads, the frame starts coming in, we send that frame to our face mesh, where it can actually render the landmarks. So, basically on the const on results functions, what we have done is that we have taken our video input, then on top of it, we are rendering the canvas element, right? Using the canvas CTX. And what we're doing is that we are going to be now rendering the facial landmarks on top of the actual frame that we're seeing.
5. Demonstration of Face Mesh and Selfie Segmentation
In the demonstration, we render different landmarks and return the webcam and canvas. We can see the face mesh on top of the demo, and it tracks facial landmarks as we move. We can also try the selfie segmentation, which provides a custom background by colorizing everything except the human body. You can explore other solutions provided by the React code.
So that is what you see over here. Using basically the draw connectors. This is a util that has been provided by media pipe, and we are using that. So very quickly, what we're doing is that we are rendering all the different landmarks. And we are finally returning our webcam, and also the canvas that is going to be put on top of the webcam.
Then finally, we are exporting this React component in our R.js. So very quickly, jumping into the demonstration. So I'll open up incognito mode, and I'll go very quickly to Localhost 3000, where it will open up the webcam. As you can see, Hi, everyone. That's the second me, and very soon I should be able to see the face mesh actually land up on top of the demo. As you can see, that's the face mesh, boom. Great. And as you can see that, as I move around, I open my mouth, I close my eye, you can see that how all the facial landmarks are, you know, happening. So I can close this, and very quickly change to, let's say, one another demonstration. Let's see if we can use this thing, the selfie segmentation. And in this, what I'll do is I'll basically comment out my face mesh, and I'll just comment this out, and I'll comment out this selfie segmentation, and I'll save it. And till then, like it's loading, you can see the selfie segmentation over here again. We have used it. And what we're doing over here is that we are going to be providing a custom background, right? So that background that we're going to be providing is going to be defined over here, when basically we are using the canvas.fill style. So it will basically not colorize your human body, but it will colorize all the other things. So that is like we're using the fill style to basically, let's say, add a virtual background.
So if I go again, if I quickly go back to my incognito mode, I can go and look at localhost 3000, and very soon, I should be able to, if it, I hope everything works fine. So as you can see, this is my camera and my frame is coming in. And very soon it should load the selfie segmentation model and let's just wait for it. So as you can see, like a blue background. So essentially again, how it's working is that it is taking in your body and it is segmenting out your human from the frame. And it's basically coloring the rest of the entire background with this blue color, because it is able to segment out your human body, you know, and just color the rest of the things. So similarly, you can try out various kinds of solutions that are, you know, there. Of course, for my demonstration, I showed you the face mesh and also I showed you the selfie segmentation, but you can also try out all the other ones that are shared inside of the NPM modules that are, you know, provided by the React code. So that essentially is what I wanted to particularly show with respect to the demonstration.
6. Conclusion and Future of MediaPipe
The logic behind the selfie segmentation code is concise, requiring only around 70 to 80 lines of code. MediaPipe enables the creation of amazing applications without needing to understand the underlying computer vision or machine learning processes. It is being widely used in production environments, making it the future of the web. You can connect with me on Twitter and GitHub for any queries or assistance with MediaPipe integration. Thank you for being a part of React Advanced!
Again, it's super quick what I have just shared with everyone. Right. That you just have, like literally with the, even with the selfie segmentation code, the actual logic behind, you know, the actual selfie segmentation that we're writing is literally not more than from line number 10 to line number probably to, I guess, 36. So within like 70 to 80 lines of code, you're really creating such kind of wonderful application.
And you can just think about what kind of amazing applications that you could probably think of, and you could actually create with help of a Mediapipe. And these are like just two of the examples that I have shown you. So I mean, the sky's the limit and the best part is that you don't really need to know what's happening behind the scenes. You don't need to know what kind of computer vision or machine learning is happening. You just have to integrate it. And that is why today we are seeing MediaPipe being used in so many live examples in production environments by companies, by startups. So it's really the future of a web, right? And it's just easily integrable with React.
That is the future of the web as well. So with that, that brings an end to my presentation. I hope you liked it. You can connect with me on my Twitter at how to help or on my GitHub on GitHub.com. And if you have any queries with regards to a MediaPipe or being able to integrate MediaPipe with React, I'll be more than happy to sort of help you out. And I hope that everyone has a great React Advanced. I really loved being a part of it. And hopefully next year, whenever it takes place, I'll meet everyone in the real world. So thank you so much. With that, I sign off. That's it.
Comments