Video Summary and Transcription
The talk delves into the intricacies of handling large volumes of robot data. The first step involves data extraction by joining observations based on the nearest timestamp. Feature engineering is crucial for analyzing what happens before a failure. Due to the large volume of data, Kubernetes and Pachyderm are used for data versioning and management. Kubernetes clusters are surprisingly affordable, making scalable data collection and processing feasible. AI at the robot fleet level unlocks new opportunities. The 'future' package in R simplifies parallelizing code, ensuring efficient data handling. Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months.
1. Introduction to Robot Data Analysis
Hello. My name is Frans van den Een, Chief Data Officer at Expantia. Together with Flodian Pistoni, the CEO of Inorbit, we address the challenges of managing and analyzing the increasing volume of data gathered from robots. Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months alone. One of the main challenges is the data extraction process, where we need to join observations based on the nearest timestamp and perform feature engineering. We analyze what happens before a failure by looking back a certain time period for each observation.
Hello. Thank you very much for your interest in this talk. My name is Frans van den Een. I'm the Chief Data Officer at Expantia, and we help organisations to boost their data analytics capacity.
This talk was prepared together with Flodian Pistoni. Flodian is the co-founder and CEO of Inorbit, a cloud robot management platform. They're taking robots and developing them further to handle robot operations. With the increase in robot usage, especially during COVID, robots are working alongside humans and autonomously, resulting in a significant increase in the volume of data gathered from robots. Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months alone, and they continue to grow rapidly, adding a year's worth of data every day.
One of the main challenges we encountered is that Inorbit offers its services to any fleet, and we have no control over how the data is gathered and sent to the central service. In one of the proof-of-concepts (POCs) we conducted, we faced the issue of many robots sending millions of files, with each file containing data from multiple sources. These sources, such as robot or in-agent operations, had their own timestamps and were not directly related. The first step we took was data extraction, which proved to be more complex than expected. We needed to join observations based on the nearest timestamp and perform feature engineering on top of that.
Nearest time joining involved finding an interval where we could join different signals about mission localization speed to create a single observation. Once we had one observation per time unit, we focused on feature engineering. We wanted to analyze what happened before a failure, looking back a certain time period for each observation.
2. Data Extraction and Feature Engineering
The first step was data extraction, which involved joining observations by nearest timestamp. We then performed feature engineering and analyzed what happened before a failure. Due to the large volume of data, we decided to use Kubernetes and Packyderm, an open-source product that offers a versioned data pipeline. This allows for easier data management and automatic updates in the pipeline.
They were not directly related. So the first step that we needed to do was data extraction. And this was a little bit more complex than we expected, especially because we needed to join observations by nearest timestamp. I will highlight that a little bit in one slide further.
And then do the feature engineering on top of that. So what I mean with the nearest time joining, we have different signals about mission localization speed. And we need to find an interval where we can join each and every one of those signals to have a single observation. We worked out how that could be done. And then we needed to start the feature engineering. So once we have one line, one observation per time unit, we wanted to look back. We wanted to look at what happened before a failure. There's a failure right here. And if we go back say 42 seconds, then we need to do that for each and every one. Doing that for and taking into account all the cases where we couldn't include the datum, for instance, when there was a failure within the 42 second time frame was absolutely possible. But then we were faced with an enormous volume where our local computer simply said, no, this is not going to be possible. So we immediately thought about forming this out to Kubernetes.
We set up a bucket with the data that was going in. We packaged them in one day zips. We packed gstuff, Json, and gzzips to make it a little bit more workable and to be able to transport the data with more ease. And then form out the full data extraction of Kubernetes, a second bucket with the intermediate result, farm up the data engineering, we have the result ready for analysis.
What we found is that it is much easier to get the help of something called Packyderm. So Packyderm is a product, it's a company. They have an open source version of this, which is what we use. And what we have there is not a bucket. What we have is a filing cabinet. We have a repository where we conversion the data that is coming in and version the data that is coming out. Doing this kind of data pipeline with versioning means if there is one change at any point in the pipeline, the rest of the pipeline will respond and update the data automatically. So that prepares us to do all the, to have all the heavy lifting ready once we bring this into production. Just a quick look at what this look like. We create pipelines that are very similar to your configuration files and the key thing here is that we can connect the data in the repository, in the Pachyderm repository, PFS is the Pachyderm file system, to what we're running in our R script.
3. Parallelization and Scalability
Our parallelized R scripts were not enough, so we farmed out the connection and data preparation. Monitoring the data on a granular level was great. Parallelizing R code is easy with the 'future' package. Working with Kubernetes made the transition to a massively parallel pipeline easier. Large clusters are surprisingly affordable, allowing for scalable data collection and processing. Using AI at the robot fleet level unlocks new opportunities. Visit expanded.com or inorbit.ai for more information.
Because our R scripts were already parallelized, but parallelizing was not enough. So we were now able to farm this out and making that connection, setting whatever data preparation was we're doing next, and for each datum that is the Pachyderm term for each unit of data in the pipeline is, we can see whether it was successful or whether it failed. This is a screen from Patrick Santamaria where we did most of this work.
So being able to monitor on that level what is happening with your data is in practice, it was absolutely great. So parallelizing R code is easy. There's a package called built on future. And it's very easy to parallelize it out. Going from parallel R code to a massively parallel pipeline is doable. For us it was much easier to work with Kubernetes through that.
The other thing was that large clusters are surprisingly affordable, right? We worked on a 80 CPU, 520 gigabyte cluster for under $10 an hour. Which was something we haven't experienced before and now we're using more and more in the work we're doing. For the team that we were working with for in orbit, this also had huge implications. Because it means that as fleets grow, they know that collecting and processing data is you know, becomes critical for them. They also now know that where they understand the scaling of their full platform, it doesn't need to be hard. But scaling up the analysis doesn't need to be hard either. And using AI at the robot fleet level, you know, unlocks many new opportunities that they are working hard on to continue to offer to their customers.
Thank you very much. I hope that this was useful. If you want any further information, please visit our websites, expanded.com or inorbit.ai. We both have blogs running and we like to write about this stuff, so we hope to be in touch sometime in the future.
Comments