Latest from MIT Tech Review – The robot race is fueling a fight for training data

Since ChatGPT was released, we’re interacting with AI tools more directly—and regularly—than ever before.

But interacting with robots, by way of contrast, is still a rarity for most. If you don’t undergo complex surgery or work in logistics, the most advanced robot you encounter in your daily life might still be a vacuum cleaner (if you’re feeling young, the first Roomba was released 22 years ago).

But that’s on the cusp of changing. Roboticists believe that by using new AI techniques, they will achieve something the field has pined after for decades: more capable robots that can move freely through unfamiliar environments and tackle challenges they’ve never seen before.

“It’s like being strapped to the front of a rocket,” says Russ Tedrake, vice president of robotics research at the Toyota Research Institute, says of the field’s pace right now. Tedrake says he has seen plenty of hype cycles rise and fall, but none like this one. “I’ve been in the field for 20-some years. This is different,” he says.

But something is slowing that rocket down: lack of access to the types of data used to train robots so they can interact more smoothly with the physical world. It’s far harder to come by than the data used to train the most advanced AI models like GPT—mostly text, images, and videos scraped off the internet. Simulation programs can help robots learn how to interact with places and objects, but the results still tend to fall prey to what’s known as the “sim-to-real gap,” or failures that arise when robots move from the simulation to the real world.

For now, we still need access to physical, real-world data to train robots. That data is relatively scarce and tends to require a lot more time, effort, and expensive equipment to collect. That scarcity is one of the main things currently holding progress in robotics back.

As a result, leading companies and labs are in fierce competition to find new and better ways to gather the data they need. It’s led them down strange paths, like using robotic arms to flip pancakes for hours on end, watching thousands of hours of graphic surgery videos pulled from YouTube, or deploying researchers to numerous Airbnbs in order to film every nook and cranny. Along the way, they’re running into the same sorts of privacy, ethics, and copyright issues as their counterparts in the world of chatbots.

The new need for data

For decades, robots were trained on specific tasks, like picking up a tennis ball or doing a somersault. While humans learn about the physical world through observation and trial and error, many robots were learning through equations and code. This method was slow, but even worse, it meant that robots couldn’t transfer skills from one task to a new one.

But now, AI advances are fast-tracking a shift that had already begun: letting robots teach themselves through data. Just as a language model can learn from a library’s worth of novels, robot models can be shown a few hundred demonstrations of a person washing ketchup off a plate using robotic grippers, for example, and then imitate the task without being taught explicitly what ketchup looks like or how to turn on the faucet. This approach is bringing faster progress and machines with much more general capabilities.

Now every leading company and lab is trying to enable robots to reason their way through new tasks using AI. Whether they succeed will hinge on whether researchers can find enough diverse types of data to fine-tune models for robots, as well as novel ways to use reinforcement learning to let them know when they’re right and when they’re wrong.

“A lot of people are scrambling to figure out what’s the next big data source,” says Pras Velagapudi, chief technology officer of Agility Robotics, which makes a humanoid robot that operates in warehouses for customers including Amazon. The answers to Velagapudi’s question will help define what tomorrow’s machines will excel at, and what roles they may fill in our homes and workplaces.

Prime training data

To understand how roboticists are shopping for data, picture a butcher shop. There are prime, expensive cuts ready to be cooked. There are the humble, everyday staples. And then there’s the case of trimmings and off-cuts lurking in the back, requiring a creative chef to make them into something delicious. They’re all usable, but they’re not all equal.

For a taste of what prime data looks like for robots, consider the methods adopted by the Toyota Research Institute (TRI). Amid a sprawling laboratory in Cambridge, Massachusetts, equipped with robotic arms, computers, and a random assortment of everyday objects like dustpans and egg whisks, researchers teach robots new tasks through teleoperation, creating what’s called demonstration data. A human might use a robotic arm to flip a pancake 300 times in an afternoon, for example.

The model processes that data overnight, and then often the robot can perform the task autonomously the next morning, TRI says. Since the demonstrations show many iterations of the same task, teleoperation creates rich, precisely labeled data that helps robots perform well in new tasks.

The trouble is, creating such data takes ages, and it’s also limited by the number of expensive robots you can afford. To create quality training data more cheaply and efficiently, Shuran Song, head of the Robotics and Embodied AI Lab at Stanford University, designed a device that can more nimbly be used with your hands, and built at a fraction of the cost. Essentially a lightweight plastic gripper, it can collect data while you use it for everyday activities like cracking an egg or setting the table. The data can then be used to train robots to mimic those tasks. Using simpler devices like this could fast-track the data collection process.

Open-source efforts

Roboticists have recently alighted upon another method for getting more teleoperation data: sharing what they’ve collected with each other, thus saving them the laborious process of creating data sets alone.

The Distributed Robot Interaction Dataset (DROID), published last month, was created by researchers at 13 institutions, including companies like Google DeepMind and top universities like Stanford and Carnegie Mellon. It contains 350 hours of data generated by humans doing tasks ranging from closing a waffle maker to cleaning up a desk. Since the data was collected using hardware that’s common in the robotics world, researchers can use it to create AI models and then test those models on equipment they already have.

The effort builds on the success of the Open X-Embodiment Collaboration, a similar project from Google DeepMind that aggregated data on 527 skills, collected from a variety of different types of hardware. The data set helped build Google DeepMind’s RT-X model, which can turn text instructions (for example, “Move the apple to the left of the soda can”) into physical movements.

Robotics models built on open-source data like this can be impressive, says Lerrel Pinto, a researcher who runs the General-purpose Robotics and AI Lab at New York University. But they can’t perform across a wide enough range of use cases to compete with proprietary models built by leading private companies. What is available via open source is simply not enough for labs to successfully build models at a scale that would produce the gold standard: robots that have general capabilities and can receive instructions through text, image, and video.

“The biggest limitation is the data,” he says. Only wealthy companies have enough.

These companies’ data advantage is only getting more thoroughly cemented over time. In their pursuit of more training data, private robotics companies with large customer bases have a not-so-secret weapon: their robots themselves are perpetual data-collecting machines.

Covariant, a robotics company founded in 2017 by OpenAI researchers, deploys robots trained to identify and pick items in warehouses for companies like Crate & Barrel and Bonprix. These machines constantly collect footage, which is then sent back to Covariant. Every time the robot fails to pick up a bottle of shampoo, for example, it becomes a data point to learn from, and the model improves its shampoo-picking abilities for next time. The result is a massive, proprietary data set collected by the company’s own machines.

This data set is part of why earlier this year Covariant was able to release a powerful foundation model, as AI models capable of a variety of uses are known. Customers can now communicate with its commercial robots much as you’d converse with a chatbot: you can ask questions, show photos, and instruct it to take a video of itself moving an item from one crate to another. These customer interactions with the model, which is called RFM-1, then produce even more data to help it improve.

Peter Chen, cofounder and CEO of Covariant, says exposing the robots to a number of different objects and environments is crucial to the model’s success. “We have robots handling apparel, pharmaceuticals, cosmetics, and fresh groceries,” he says. “It’s one of the unique strengths behind our data set.” Up next will be bringing its fleet into more sectors and even having the AI model power different types of robots, like humanoids, Chen says.

Learning from video

The scarcity of high-quality teleoperation and real-world data has led some roboticists to propose bypassing that collection method altogether. What if robots could just learn from videos of people?

Such video data is easier to produce, but unlike teleoperation data, it lacks “kinematic” data points, which plot the exact movements of a robotic arm as it moves through space.

Researchers from the University of Washington and Nvidia have created a workaround, building a mobile app that lets people train robots using augmented reality. Users take videos of themselves completing simple tasks with their hands, like picking up a mug, and the AR program can translate the results into waypoints for the robotics software to learn from.

Meta AI is pursuing a similar collection method on a larger scale through its Ego4D project, a data set of more than 3,700 hours of video taken by people around the world doing everything from laying bricks to playing basketball to kneading bread dough. The data set is broken down by task and contains thousands of annotations, which detail what’s happening in each scene, like when a weed has been removed from a garden or a piece of wood is fully sanded.

Learning from video data means that robots can encounter a much wider variety of tasks than they could if they relied solely on human teleoperation (imagine folding croissant dough with robot arms). That’s important, because just as powerful language models need complex and diverse data to learn, roboticists can create their own powerful models only if they expose robots to thousands of tasks.

To that end, some researchers are trying to wring useful insights from a vast source of abundant but low-quality data: YouTube. With thousands of hours of video uploaded every minute, there is no shortage of available content. The trouble is that most of it is pretty useless for a robot. That’s because it’s not labeled with the types of information robots need, like annotations or kinematic data.

SARAH ROGERS/MITTR | GETTY

“You can say [to a robot], Oh, this is a person playing Frisbee with their dog,” says Chen, of Covariant, imagining a typical video that might be found on YouTube. “But it’s very difficult for you to say, Well, when this person throws a Frisbee, this is the acceleration and the rotation and that’s why it flies this way.”

Nonetheless, a few attempts have proved promising. When he was a postdoc at Stanford, AI researcher Emmett Goodman looked into how AI could be brought into the operating room to make surgeries safer and more predictable. Lack of data quickly became a roadblock. In laparoscopic surgeries, surgeons often use robotic arms to manipulate surgical tools inserted through very small incisions in the body. Those robotic arms have cameras capturing footage that can help train models, once personally identifying information has been removed from the data. In more traditional open surgeries, on the other hand, surgeons use their hands instead of robotic arms. That produces much less data to build AI models with.

“That is the main barrier to why open-surgery AI is the slowest to develop,” he says. “How do you actually collect that data?”

To tackle that problem, Goodman trained an AI model on thousands of hours of open-surgery videos, taken by doctors with handheld or overhead cameras, that his team gathered from YouTube (with identifiable information removed). His model, as described in a paper in the medical journal JAMA in December 2023, could then identify segments of the operations from the videos. This laid the groundwork for creating useful training data, though Goodman admits that the barriers to doing so at scale, like patient privacy and informed consent, have not been overcome.

Uncharted legal waters

Chances are that wherever roboticists turn for their new troves of training data, they’ll at some point have to wrestle with some major legal battles.

The makers of large language models are already having to navigate questions of credit and copyright. A lawsuit filed by the New York Times alleges that ChatGPT copies the expressive style of its stories when generating text. The chief technical officer of OpenAI recently made headlines when she said the company’s video generation tool Sora was trained on publicly available data, sparking a critique from YouTube’s CEO, who said that if Sora learned from YouTube videos, it would be a violation of the platform’s terms of service.

“It is an area where there’s a substantial amount of legal uncertainty,” says Frank Pasquale, a professor at Cornell Law School. If robotics companies want to join other AI companies in using copyrighted works in their training sets, it’s unclear whether that’s allowed under the fair-use doctrine, which permits copyrighted material to be used without permission in a narrow set of circumstances. An example often cited by tech companies and those sympathetic to their view is the 2015 case of Google Books, in which courts found that Google did not violate copyright laws in making a searchable database of millions of books. That legal precedent may tilt the scales slightly in tech companies’ favor, Pasquale says.

It’s far too soon to tell whether legal challenges will slow down the robotics rocket ship, since AI-related cases are sprawling and still undecided. But it’s safe to say that roboticists scouring YouTube or other internet video sources for training data will be wading in fairly uncharted waters.

The next era

Not every roboticist feels that data is the missing link for the next breakthrough. Some argue that if we build a good enough virtual world for robots to learn in, maybe we don’t need training data from the real world at all. Why go through the effort of training a pancake-flipping robot in a real kitchen, for example, if it could learn through a digital simulation of a Waffle House instead?

Roboticists have long used simulator programs, which digitally replicate the environments that robots navigate through, often down to details like the texture of the floorboards or the shadows cast by overhead lights. But as powerful as they are, roboticists using these programs to train machines have always had to work around that sim-to-real gap.

Now the gap might be shrinking. Advanced image generation techniques and faster processing are allowing simulations to look more like the real world. Nvidia, which leveraged its experience in video game graphics to build the leading robotics simulator, called Isaac Sim, announced last month that leading humanoid robotics companies like Figure and Agility are using its program to build foundation models. These companies build virtual replicas of their robots in the simulator and then unleash them to explore a range of new environments and tasks.

Deepu Talla, vice president of robotics and edge computing at Nvidia, doesn’t hold back in predicting that this way of training will nearly replace the act of training robots in the real world. It’s simply far cheaper, he says.

“It’s going to be a million to one, if not more, in terms of how much stuff is going to be done in simulation,” he says. “Because we can afford to do it.”

But if models can solve some of the “cognitive” problems, like learning new tasks, there are a host of challenges to realizing that success in an effective and safe physical form, says Aaron Saunders, chief technology officer of Boston Dynamics. We’re a long way from building hardware that can sense different types of materials, scrub and clean, or apply a gentle amount of force.

“There’s still a massive piece of the equation around how we’re going to program robots to actually act on all that information to interact with that world,” he says.

If we solved that problem, what would the robotic future look like? We could see nimble robots that help people with physical disabilities move through their homes, autonomous drones that clean up pollution or hazardous waste, or surgical robots that make microscopic incisions, leading to operations with a reduced risk of complications. For all these optimistic visions, though, more controversial ones are already brewing. The use of AI by militaries worldwide is on the rise, and the emergence of autonomous weapons raises troubling questions.

The labs and companies poised to lead in the race for data include, at the moment, the humanoid-robot startups beloved by investors (Figure AI was recently boosted by a $675 million funding round), commercial companies with sizable fleets of robots collecting data, and drone companies buoyed by significant military investment. Meanwhile, smaller academic labs are doing more with less to create data sets that rival those available to Big Tech.

But what’s clear to everyone I speak with is that we’re at the very beginning of the robot data race. Since the correct way forward is far from obvious, all roboticists worth their salt are pursuing any and all methods to see what sticks.

There “isn’t really a consensus” in the field, says Benjamin Burchfiel, a senior research scientist in robotics at TRI. “And that’s a healthy place to be.”