Processing Robot Data at Scale with R and Kubernetes

Rate this content
Bookmark

Most people would agree that R is a popular language for data analysis. Perhaps less well known is that R has good support for parallel execution on a single CPU through packages like future. In this presentation we will talk about our experience scaling up R processes even further running R in parallel in docker containers using Kubernetes. Robots generate massive amounts of sensor and other data; extracting the right information and insights from this requires significant more processing than can be tackled on a single execution environment. Faced with a preprocessing job of several hundred GB of data of compressed json line files, we used Pachyderm to write data pipelines to run the data prep in parallel, using multicore containers on a kubernetes cluster.


By the end of the talk we will have dispelled the myth that R cannot be used in production at scale. Even if you do not use R, you will have seen a use case to scale up analysis regardless of your language of choice.

This talk has been presented at ML conf EU 2020, check out the latest edition of this Tech Conference.

FAQ

Frans van den Een is the Chief Data Officer at Expantia, a company focused on enhancing data analytics capacities for various organizations.

Inorbit is a cloud robot management platform co-founded by Flodian Pistoni, which focuses on further developing robot operations to work autonomously and alongside humans.

Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months.

Expantia faced challenges in data extraction due to the complex nature of the data, which involved multiple sources with their own timestamps. They needed to join observations by nearest timestamp and perform feature engineering.

Expantia used Kubernetes for data extraction and leveraged Pachyderm for data versioning and management, which helped automate updates in the data pipeline.

Expantia enhanced their data processing by setting up a massively parallel pipeline using Pachyderm and parallelizing R code to handle large data volumes efficiently.

Expantia used a large cluster with 80 CPUs and 520 gigabytes for under $10 an hour, which proved to be surprisingly affordable and effective for their data analysis needs.

Using AI at the robot fleet level allows Inorbit to unlock new opportunities and enhance the capabilities of their platform as their fleets grow, ensuring efficient data collection and processing.

Frans van Dunné
Frans van Dunné
8 min
02 Jul, 2021

Comments

Sign in or register to post your comment.

Video Summary and Transcription

The talk delves into the intricacies of handling large volumes of robot data. The first step involves data extraction by joining observations based on the nearest timestamp. Feature engineering is crucial for analyzing what happens before a failure. Due to the large volume of data, Kubernetes and Pachyderm are used for data versioning and management. Kubernetes clusters are surprisingly affordable, making scalable data collection and processing feasible. AI at the robot fleet level unlocks new opportunities. The 'future' package in R simplifies parallelizing code, ensuring efficient data handling. Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months.

1. Introduction to Robot Data Analysis

Short description:

Hello. My name is Frans van den Een, Chief Data Officer at Expantia. Together with Flodian Pistoni, the CEO of Inorbit, we address the challenges of managing and analyzing the increasing volume of data gathered from robots. Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months alone. One of the main challenges is the data extraction process, where we need to join observations based on the nearest timestamp and perform feature engineering. We analyze what happens before a failure by looking back a certain time period for each observation.

Hello. Thank you very much for your interest in this talk. My name is Frans van den Een. I'm the Chief Data Officer at Expantia, and we help organisations to boost their data analytics capacity.

This talk was prepared together with Flodian Pistoni. Flodian is the co-founder and CEO of Inorbit, a cloud robot management platform. They're taking robots and developing them further to handle robot operations. With the increase in robot usage, especially during COVID, robots are working alongside humans and autonomously, resulting in a significant increase in the volume of data gathered from robots. Inorbit has accumulated over 3.8 million hours of robot data in the last 12 months alone, and they continue to grow rapidly, adding a year's worth of data every day.

One of the main challenges we encountered is that Inorbit offers its services to any fleet, and we have no control over how the data is gathered and sent to the central service. In one of the proof-of-concepts (POCs) we conducted, we faced the issue of many robots sending millions of files, with each file containing data from multiple sources. These sources, such as robot or in-agent operations, had their own timestamps and were not directly related. The first step we took was data extraction, which proved to be more complex than expected. We needed to join observations based on the nearest timestamp and perform feature engineering on top of that.

Nearest time joining involved finding an interval where we could join different signals about mission localization speed to create a single observation. Once we had one observation per time unit, we focused on feature engineering. We wanted to analyze what happened before a failure, looking back a certain time period for each observation.

2. Data Extraction and Feature Engineering

Short description:

The first step was data extraction, which involved joining observations by nearest timestamp. We then performed feature engineering and analyzed what happened before a failure. Due to the large volume of data, we decided to use Kubernetes and Packyderm, an open-source product that offers a versioned data pipeline. This allows for easier data management and automatic updates in the pipeline.

They were not directly related. So the first step that we needed to do was data extraction. And this was a little bit more complex than we expected, especially because we needed to join observations by nearest timestamp. I will highlight that a little bit in one slide further.

And then do the feature engineering on top of that. So what I mean with the nearest time joining, we have different signals about mission localization speed. And we need to find an interval where we can join each and every one of those signals to have a single observation. We worked out how that could be done. And then we needed to start the feature engineering. So once we have one line, one observation per time unit, we wanted to look back. We wanted to look at what happened before a failure. There's a failure right here. And if we go back say 42 seconds, then we need to do that for each and every one. Doing that for and taking into account all the cases where we couldn't include the datum, for instance, when there was a failure within the 42 second time frame was absolutely possible. But then we were faced with an enormous volume where our local computer simply said, no, this is not going to be possible. So we immediately thought about forming this out to Kubernetes.

We set up a bucket with the data that was going in. We packaged them in one day zips. We packed gstuff, Json, and gzzips to make it a little bit more workable and to be able to transport the data with more ease. And then form out the full data extraction of Kubernetes, a second bucket with the intermediate result, farm up the data engineering, we have the result ready for analysis.

What we found is that it is much easier to get the help of something called Packyderm. So Packyderm is a product, it's a company. They have an open source version of this, which is what we use. And what we have there is not a bucket. What we have is a filing cabinet. We have a repository where we conversion the data that is coming in and version the data that is coming out. Doing this kind of data pipeline with versioning means if there is one change at any point in the pipeline, the rest of the pipeline will respond and update the data automatically. So that prepares us to do all the, to have all the heavy lifting ready once we bring this into production. Just a quick look at what this look like. We create pipelines that are very similar to your configuration files and the key thing here is that we can connect the data in the repository, in the Pachyderm repository, PFS is the Pachyderm file system, to what we're running in our R script.