Video Summary and Transcription
This Talk discusses the challenges of video editing in the browser and the limitations of existing tools. It explores image compression techniques, including Fourier transform and Huffman encoding, to reduce file sizes. The video codec and frame decoding process are explained, highlighting the importance of keyframes and delta frames. The performance bottleneck is identified as the codec, and the need for specialized hardware for efficient video editing is emphasized. The Talk concludes with a call to create a simplified API for video editing in the browser and the potential for AI-powered video editing.
1. Introduction to Video Editing in the Browser
Hey, everyone. Today, I want to talk about video editing in the browser. I spent a lot of time doing video editing during the pandemic. However, I realized that the existing tools didn't have the AI advancements I needed. I wanted to remove the green screen and shadows, and cut based on spoken words. On the other hand, I saw exciting developments in JavaScript, such as WebCodecs, TensorFlow.js, and Whisper. This talk will explain why I couldn't fully achieve a good video editor powered by AI. Let's start with thinking about making a video.
Hey, everyone. My name is Christophe Archido, also known as Vegeux on the Internet. And I've done a few things for the React community. I co-created React Native, Prettier, Excalibur, CSS in JS, but today I want to talk about something different. I want to talk about video editing in the browser.
So during the pandemic, I spent a lot of time doing video editing. And I was even thinking maybe I should go like become a YouTuber full-time. But then I realized that with this number of views, I should probably keep my job as a software engineer for a bit longer.
So what does it mean to edit videos? So I used a tool called Final Cut Pro. And I felt that it was built like many, many years ago and didn't have all of the AI advancements that we've seen recently. So for example, I bought a $20 green screen. And I need to pick the green color and the range in order to remove it. And as you can see, there's some shadows behind me in the picture. And it wasn't properly removed. Then in order to cut, I want to know what am I actually saying to know which part I should be cutting. But I only got the sound waves and not the actual words spoken. On the other side, I was looking at the JavaScript, like the browser news, and I saw a lot of super exciting stuff happening. So we can start doing encoding and decoding with WebCodecs. TensorFlow.js lets you remove the background from the video. And then, Whisper is letting you take what I'm saying into actual words. So we had seemingly all of the building blocks in order to be able to do a really good video editor powered by AI, but unfortunately, I wasn't able to get all the way there. And this talk is going to be the story of why.
So usually when I walk into some new product like this, there's some things that I think are true I'm going to use to base all of the things I'm doing upon. But there were three things in this case that were not true. So the first one is that time only travels forward. The second is that when you encode one frame, you're getting one frame back. And finally that WASM is faster than JavaScript for video decoding. So if you want to know why this is not true, buckle up. We're getting to it. So let's start with thinking about making a video.
2. Video Editing API and Image Compression
Unfortunately, the desired API for video editing in the browser is not possible due to the large file sizes involved. A single image of a thousand by thousand pixels can already be around four megabytes in size. With 60 frames per second, a one-second video would be around 200 megabytes. This is too big for current browsers and computers. However, image compression techniques have been developed to address this issue, which will be discussed in the following minutes.
And unfortunately I cannot be here in person today, so what I decided to do was to bring some of the sunny California to Amsterdam. And for this I put a palm tree in all of the pictures. So in this case, we have React summit in the background and then moving to the foreground and the palm tree fading away. So what would be the API that I would expect to be able to do that? So I initially wanted a load video kind of API. That takes a file path and returns me a list of images. And then I'm going to modify the images, remove the background, like cut and paste and a bunch of stuff. And then like a save video that would take the file path and render. And like a list of images and like actually save the video.
So unfortunately, this API cannot exist. So let's see why. So let's go into like one image of this whole video. And not too big, not too small. Like a thousand by thousand image. And how large is it actually to represent this? So it's going to be like one thousand by one thousand pixels. About one megabyte. And then there's red, green and blue. And so we are about like four megabytes in size. And this is just for one image. Now, if you want like 60 fps, like one second, you're going to be at like 200 megabytes for every single second. So this talk right now is around 20 minutes. So this is going to be big. And this is actually going to be too big for the browser or like any computer right now. And what do we do? So fortunately, a lot of very smart people have worked on this for years. And what they built is a shrinking machine. Well, not exactly. What people have been doing is image compression. And so I'm going to talk for like the next few minutes around like different types of image compression. And not because I find interesting, which I do, but because they actually have a big factor into the actual API used for video encoding. So let's see the main ideas around video encoding. Sorry, about image compression.
3. Image Compression Techniques
In video frames with only two colors, we can use run-length encoding to reduce the size. However, this technique is not effective for images with varying pixel colors. We will explore other techniques to address this issue.
So if you look at this one frame, one thing that we can see is that there's only two colors being used and there's a lot of white and dark pixels. And so instead of displaying seven dark pixels in a row with red, green, blue, red, green, blue, red, green, blue, what we can do is to start writing one byte, which is a number of the pixel, and then the red, green, blue once, and then we can basically repeat it like this. And so this is a technique called run-length encoding. So now this technique in itself is very useful for images with only two colors, but if you take a picture with your camera, you're never going to be to see two pixels next to each other with the exact same color. And we're going to see next how to help with this, but keep in mind that this is a building block for compressing images that's going to be used in the pipeline.
4. Image Compression Techniques Continued
We can use a Fourier transform to decompose the image into sinusoidal functions and reduce the information needed to encode it. Another technique is Huffman encoding, which remaps patterns of 0s and 1s to compress the image. These building blocks, along with others, enable a 10x reduction in image size. However, for videos, we can further improve compression by encoding only the differences between consecutive frames and predicting the next frame.
So now the other strategy I'm going to talk about, you need to have some imagination for this. So in this case, we're going to think of the image not as a series of pixels but as a continuous function. And one of the things we can do with a continuous function is to run a Fourier transform on top of this, that's going to decompose this continuous function into an infinite sum of sinusoidal functions like sin and cos with some variation. So why would you want that? How is it useful? So what you can see in the illustration is the first few sinusoidal functions, they actually end up being very close to the final function and then the more you go down into those sinusoidal functions, the less information that they have.
In practice, if you just keep the first few and re-encode it back, you're going to get very close image but you lose a lot of the details that you may not be able to perceive. Now, every single one of those is taking roughly the same amount of bits to encode, so by doing this you're able to reduce the information that you have to encode in order to compress the image. And the third technique I'm going to talk about... You also need to think about the image in a different way, in this case a series of 0 and 1s. And so one of the things we can start observing is that some patterns are emerging.
So for example, the sequence 0 1 0 1 is there 15 times. Then the sequence 1 1 0 is only there like 7 times and then you keep going and at some point some of them are only going to be there once. And the idea behind the compression is you can do a remapping. You can remap 0 1 0 1 to the bit 0, then you can remap 1 1 0 to the 2 bits 1 0, and then you keep doing keep doing and at some point, because you mapped a bunch of things to smaller things like some things would need to be mapped to bigger things. But if you look at the entire like sequences of 0's and 1's, it's going to compress using this technique if you also add the mapping table. So this is called Hoffman encoding. So these are like three building blocks in order to compress the image. And what is the result of this? We are able to get a 10x reduction in the size of the image. So, going from like 4 MBs, we're like about 400K. And the name of this step is image compression. And the most popular image compressions like out there are like JPEG, WEBP, PNG. And so this is like, they use all of these building blocks and a few more in order to compress the image. So we've made like massive progress into getting the image to be smaller, but it's still like 20 MBs per second for our video. So this is like still too much. So what else can we do? So for this, we can think of our video as a series of images. But now what you can see is like all of the images next to each other are actually very, very close to each other. And so there's probably something we can do about it. So the first idea is we're going to only, we're going to like do a diff of like the image before and the image after. And encode only the diff using those strategies before. And so this is working and this is giving better results, but we can do even better. What we can do is to start predicting what the next image is going to be.
5. Video Codec and Frame Decoding
The video codec reduces the size of the frames drastically. There are two types of frames: keyframes and delta frames. To decode the video, a stateful API is required, and frames need to be sent in a specific order. Bidirectional frames (B-frames) optimize video encoding by decoding frames in both directions. This introduces two notions of time: presentation time and decoding timestamp. The API should include a load video API and a decoder API.
And in this case, the palm tree goes from the top left to the bottom right. And so you can start predicting where the next image is going to be and then do the delta based on this prediction. And so this step is called a video codec. And the most popular video codecs are H.264 and AVC, which are the same thing but with a different name like JavaScript and Xamarin. And there's also AV1, VP8, VP9.
So this video codec is able again to reduce the size drastically. So in this case, this is how our setup looks like. So we now have two types of frames. We have keyframes, so in this case, the first frame, which is using something like JPEG to compress it, and then we have delta frames. So in this case, like every one in this picture.
And now, in order to decode the video, it's no longer, oh, just give me like one image and I can do it. Now you need to start with the keyframe. And then, in order to decode the second one, you need to have decoded the keyframe, do the prediction where it's going to be next, and then do the delta in order to decode it. So now we are seeing that we need a stateful API and in a specific order. But this is only one part of the picture, because the people doing video encoding and compression wanted to do even better. One thing that they realized is that you can do this optimization going forward. But you can also do it backwards. So you can start from the end, do the prediction, the encoding, and then start looking at in which direction do we get the most savings. And take the one that is actually going to be the smallest overall. And so this is where the notion of bidirectional frames or B-frames comes in. So in this case, the frame number 5 is a B-frame, which means in order to decode it, you need to decode the number 4 and the number 6. And also to decode the number 6, you need to do like the 7, and 7 you need 8 and 8 you need 9, and same in the other way. And so now what you're seeing is in order to decode the video sequence in order, you need to send all of the frames in a different order. And this is where we have two notions of time. So we have the presentation time, which is the one that you expect to see in the duration of the movie, and then the second one is the decoding timestamp. And so this is the timestamp at which you need to send the frames in order to be decoded in the right order. And so this is where we've got our first, there's actually no truth, where time only goes forward. So now that we've seen the first breaking stuff, let's go back to the API, actually the real API. So in this case, we need to have some kind of load video API to give us all the frames. And then we want a decoder API.
6. Video Decoding and Performance
The web codecs provide a video decoder with options, including a callback on decoded frames. Decoding frames may not return frames in the same order they were sent. The load video API requires storing metadata, such as frame time, types, timestamps, and dependencies. Video containers like mp4, move, avi, and mkv hold the frames for the codec. The performance bottleneck is the codec, which handles image compression, decompression, predictions, and encodings.
And so in this case, the web codecs, like the browser is giving us a video decoder with a bunch of options, including one which is a callback on decoded frame. And so now you need to do decoder.decode, send it like the first frame, and then it's going to process, and at some point it's going to give us a callback with the first image. And then we do it with number two, number three, number four, and we're getting them in order. But now, what happens for our B frames, so now what we need to do is to send the frame number nine, and then the frame number eight, and then the frame seven and six. But our callback is not going to be called. It's as if nothing happens, but in practice, a bunch of things are happening behind the scene. And only when we send in the frame number five, now it's going to do the whole chain again and going to call all our frames in the right order for us to be able to use. And so this is where the truth number two is a lie. So if you're decoding one frame, you're not getting one frame back. So in practice, you may get like zero or you get made five based on how the encoding has been done. And so this is very mind bending because all of the APIs I can think of, even the asynchronous API, when you call something is going to give it to you back after some time, but it's never like you're getting one or ten or like zero, and very in an unpredictable way. So now that we've seen another truth, let's go like even deeper. So let's think about like this load video API that I talked about. So how would you encode all of this information? So now there's a lot of metadata that we need. So we need to have like, hey, how much time is a frame? Like what are the list of frames? What are our types? What are they like timestamp? What are our dependencies between them? So a lot of metadata that needs to be stored. And so if we were to do it today, you would probably implement it in JSON. But all of those five formats have been written like 20 years ago. And so they're all in binary, but the idea is the same. So what is this step called? So what is this thing called? This is called a video container. And so in practice, like the four most known ones are the mp4, move, avi, and mkv. And they all use different encoding and different ways to represent this, but they all have like very similar information that is about like a container for like then calling the codec. And this step of reading these file formats and then sending all of the frames in the right order to the codec is called demuxing. So now it's called demuxing.
So let's talk a bit about performance. So what takes time in this whole thing? So in practice, the codec is the part that takes the most time. And to refresh your mind, the codec is doing all of the image compression, decompression, all of the predictions, all of the delta, all of these encodings. And one of the ways to think about it is just look at the size of things. So the binary data is like in the tens to hundreds of kilobytes. But then the actual metadata for each frame is like tens of characters. And so you can see a very big change.
7. Video Editing Challenges and Call to Action
Video editing in the browser is complex and time-consuming, requiring specialized hardware for performance. WebCodecs allows for the use of JavaScript to leverage hardware-accelerated functionalities. However, using WebAssembly (Wasm) for file reading and decoding poses challenges due to memory copying. Wasm is not faster than raw JavaScript for video decoding, but it can be useful for code reuse. While a fully functional video editor with AI capabilities is not yet available, it is possible to decode and re-encode video files in the browser with good performance. The call to action is to create a jQuery-like API for video editing in the browser, simplifying the process and enabling AI-powered video editing.
And this is so time consuming and complicated and needs to have so much performance that now there's hardware specialized units next to the CPU or the GPU that is doing all the operation I mentioned, like the Fast Fourier Transform, the Huffman encoding, and all those kind of things in the hardware. And the reason why it's in the hardware is because just doing it in the CPU normally, even with the most handwritten C++ code, was not fast enough. And so now if you want to use Wasm, now you would have to not only have something not as fast because it's running on the CPU, but with some overhead for Wasm. And so doing this that way is going to be slower. And this is where WebCodecs is very exciting is that we're now able to use a JavaScript API, send all of this binary data and the WebCodec API is going to be using all of those hardware-accelerated functionalities. So this is exciting. Now, this is only one part of the equation. The second part is we need to do to read this file, this binary file, like doing this demuxing. Could we use Wasm for it? So again, the story is a bit more complicated. So the way Wasm works is it creates a new memory heap, like a memory space. And in order to code into it and to have it run its job, you need to copy all of the information to this new space, then do its work and then copy and then give it back to you. And here we're talking about like hundreds of kilobytes, like megabytes of information, and doing a copy for doing not a lot of work for decoding this, and then copying it back and then sending it back to the web correct API. And so doing this copy actually nullifies any of the wins you may have from using Wasm, which is faster. So this is where our third truth becomes a lie, where Wasm is not faster than using raw JavaScript for video decoding. Now one caveat I'm going to say is in practice for the de-muxing is not part of the web codec API. It's not part of the browser. So you need to do it in user lens. And there's so many C++ APIs for de-muxing have been written over the years. And so for code reuse it's actually a legit way to use Wasm for this. But it's actually not for performance reasons. So now that we've debunked these three myths and where are we at? And so in practice I wish I ended the talk with like, hey, you can use this video editor with all of its functionalities. I'm not quite there yet. So what I've been able to do is to get like, decode an entire video file and re-encode it without doing anything. And in the browser. The good news is that, one, it's actually possible and it works. And the second one is it's actually fast. So because we're using the hardware accelerated features, it's as fast as using Final Cut Pro in the browser. So the perf is there. The capability is there. But the issue is actually doing it takes hundreds of lines of in-the-wheats code that is very hard to debug and needs to understand all the things I've talked to you about so far. And so this is where my call to action to every one of you is, we need to have a jQuery of video editing in the browser. We need to clean up the API and package it in a way to be able to do a video editor with AI possible. So, are you going to be the one to build it? This is my call to action.
Comments