A Roomba recorded a woman on the toilet. How did screenshots end up on social media?
This episode we go behind the scenes of an MIT Technology Review investigation that uncovered how sensitive photos taken by an AI powered vacuum were leaked and landed on the internet.
Eileen Guo, MIT Technology Review
Albert Fox Cahn, Surveillance Technology Oversight Project
This episode was reported by Eileen Guo and produced by Emma Cillekens and Anthony Green. It was hosted by Jennifer Strong and edited by Amanda Silverman and Mat Honan. This show is mixed by Garret Lang with original music from Garret Lang and Jacob Gorski. Artwork by Stephanie Arnett.
Jennifer: As more and more companies put artificial intelligence into their products, they need data to train their systems.
And we don’t typically know where that data comes from.
But sometimes just by using a product, a company takes that as consent to use our data to improve its products and services.
Consider a device in a home, where setting it up involves just one person consenting on behalf of every person who enters… and living there—or just visiting—might be unknowingly recorded.
I’m Jennifer Strong and this episode we bring you a Tech Review investigation of training data… that was leaked from inside homes around the world.
Jennifer: Last year someone reached out to a reporter I work with… and flagged some pretty concerning photos that were floating around the internet.
Eileen Guo: They were essentially, pictures from inside people’s homes that were captured from low angles, sometimes had people and animals in them that didn’t appear to know that they were being recorded in most cases.
Jennifer: This is investigative reporter Eileen Guo.
And based on what she saw… she thought the photos might have been taken by an AI powered vacuum.
Eileen Guo: They looked like, you know, they were taken from ground level and pointing up so that you could see whole rooms, the ceilings, whoever happened to be in them…
Jennifer: So she set to work investigating. It took months.
Eileen Guo: So first we had to confirm whether or not they came from robot vacuums, as we suspected. And from there, we also had to then whittle down which robot vacuum it came from. And what we found was that they came from the largest manufacturer, by the number of sales of any robot vacuum, which is iRobot, which produces the Roomba.
Jennifer: It raised questions about whether or not these photos had been taken with consent… and how they wound up on the internet.
In one of them, a woman is sitting on a toilet.
So our colleague looked into it, and she found the images weren’t of customers… they were Roomba employees… and people the company calls ‘paid data collectors’.
In other words, the people in the photos were beta testers… and they’d agreed to participate in this process… although it wasn’t totally clear what that meant.
Eileen Guo: They’re really not as clear as you would think about what the data is ultimately being used for, who it’s being shared with and what other protocols or procedures are going to be keeping them safe—other than a broad statement that this data will be safe.
Jennifer: She doesn’t believe the people who gave permission to be recorded, really knew what they agreed to.
Eileen Guo: They understood that the robot vacuums would be taking videos from inside their houses, but they didn’t understand that, you know, they would then be labeled and viewed by humans or they didn’t understand that they would be shared with third parties outside of the country. And no one understood that there was a possibility at all that these images could end up on Facebook and Discord, which is how they ultimately got to us.
Jennifer: The investigation found these images were leaked by some data labelers in the gig economy.
At the time they were working for a data labeling company (hired by iRobot) called Scale AI.
Eileen Guo: It’s essentially very low paid workers that are being asked to label images to teach artificial intelligence how to recognize what it is that they’re seeing. And so the fact that these images were shared on the internet, was just incredibly surprising, given how incredibly surprising given how sensitive they were.
Jennifer: Labeling these images with relevant tags is called data annotation.
The process makes it easier for computers to understand and interpret the data in the form of images, text, audio, or video.
And it’s used in everything from flagging inappropriate content on social media to helping robot vacuums recognize what’s around them.
Eileen Guo: The most useful datasets to train algorithms is the most realistic, meaning that it’s sourced from real environments. But to make all of that data useful for machine learning, you actually need a person to go through and look at whatever it is, or listen to whatever it is, and categorize and label and otherwise just add context to each bit of data. You know, for self driving cars, it’s, it’s an image of a street and saying, this is a stoplight that is turning yellow, this is a stoplight that is green. This is a stop sign.
Jennifer: But there’s more than one way to label data.
Eileen Guo: If iRobot chose to, they could have gone with other models in which the data would have been safer. They could have gone with outsourcing companies that may be outsourced, but people are still working out of an office instead of on their own computers. And so their work process would be a little bit more controlled. Or they could have actually done the data annotation in house. But for whatever reason, iRobot chose not to go either of those routes.
Jennifer: When Tech Review got in contact with the company—which makes the Roomba—they confirmed the 15 images we’ve been talking about did come from their devices, but from pre-production devices. Meaning these machines weren’t released to consumers.
Eileen Guo: They said that they started an investigation into how these images leaked. They terminated their contract with Scale AI, and also said that they were going to take measures to prevent anything like this from happening in the future. But they really wouldn’t tell us what that meant.
Jennifer: These days, the most advanced robot vacuums can efficiently move around the room while also making maps of areas being cleaned.
Plus, they recognize certain objects on the floor and avoid them.
It’s why these machines no longer drive through certain kinds of messes… like dog poop for example.
But what’s different about these leaked training images is the camera isn’t pointed at the floor…
Eileen Guo: Why do these cameras point diagonally upwards? Why do they know what’s on the walls or the ceilings? How does that help them navigate around the pet waste, or the phone cords or the stray sock or whatever it is. And that has to do with some of the broader goals that iRobot has and other robot vacuum companies has for the future, which is to be able to recognize what room it’s in, based on what you have in the home. And all of that is ultimately going to serve the broader goals of these companies which is create more robots for the home and all of this data is going to ultimately help them reach those goals.
Jennifer: In other words… This data collection might be about building new products altogether.
Eileen Guo: These images are not just about iRobot. They’re not just about test users. It’s this whole data supply chain, and this whole new point where personal information can leak out that consumers aren’t really thinking of or aware of. And the thing that’s also scary about this is that as more companies adopt artificial intelligence, they need more data to train that artificial intelligence. And where is that data coming from? Is.. is a really big question.
Jennifer: Because in the US, companies aren’t required to disclose that…and privacy policies usually have some version of a line that allows consumer data to be used to improve products and services… Which includes training AI. Often, we opt in simply by using the product.
Eileen Guo: So it’s a matter of not even knowing that this is another place where we need to be worried about privacy, whether it’s robot vacuums, or Zoom or anything else that might be gathering data from us.
Jennifer: One option we expect to see more of in the future… is the use of synthetic data… or data that doesn’t come directly from real people.
And she says companies like Dyson are starting to use it.
Eileen Guo: There’s a lot of hope that synthetic data is the future. It is more privacy protecting because you don’t need real world data. There have been early research that suggests that it is just as accurate if not more so. But most of the experts that I’ve spoken to say that that is anywhere from like 10 years to multiple decades out.
Jennifer: You can find links to our reporting in the show notes… and you can support our journalism by going to tech review dot com slash subscribe.
We’ll be back… right after this.
Albert Fox Cahn: I think this is yet another wake up call that regulators and legislators are way behind in actually enacting the sort of privacy protections we need.
Albert Fox Cahn: My name’s Albert Fox Cahn. I’m the Executive Director of the Surveillance Technology Oversight Project.
Albert Fox Cahn: Right now it’s the Wild West and companies are kind of making up their own policies as they go along for what counts as a ethical policy for this type of research and development, and, you know, quite frankly, they should not be trusted to set their own ground rules and we see exactly why with this sort of debacle, because here you have a company getting its own employees to sign these ludicrous consent agreements that are just completely lopsided. Are, to my view, almost so bad that they could be unenforceable all while the government is basically taking a hands off approach on what sort of privacy protection should be in place.
Jennifer: He’s an anti-surveillance lawyer… a fellow at Yale and with Harvard’s Kennedy School.
And he describes his work as constantly fighting back against the new ways people’s data gets taken or used against them.
Albert Fox Cahn: What we see in here are terms that are designed to protect the privacy of the product, that are designed to protect the intellectual property of iRobot, but actually have no protections at all for the people who have these devices in their home. One of the things that’s really just infuriating for me about this is you have people who are using these devices in homes where it’s almost certain that a third party is going to be videotaped and there’s no provision for consent from that third party. One person is signing off for every single person who lives in that home, who visits that home, whose images might be recorded from within the home. And additionally, you have all these legal fictions in here like, oh, I guarantee that no minor will be recorded as part of this. Even though as far as we know, there’s no actual provision to make sure that people aren’t using these in houses where there are children.
Jennifer: And in the US, it’s anyone’s guess how this data will be handled.
Albert Fox Cahn: When you compare this to the situation we have in Europe where you actually have, you know, comprehensive privacy legislation where you have, you know, active enforcement agencies and regulators that are constantly pushing back at the way companies are behaving. And you have active trade unions that would prevent this sort of a testing regime with a employee most likely. You know, it’s night and day.
Jennifer: He says having employees work as beta testers is problematic… because they might not feel like they have a choice.
Albert Fox Cahn: The reality is that when you’re an employee, oftentimes you don’t have the ability to meaningfully consent. You oftentimes can’t say no. And so instead of volunteering, you’re being voluntold to bring this product into your home, to collect your data. And so you’ll have this coercive dynamic where I just don’t think, you know, at, at, from a philosophical perspective, from an ethics perspective, that you can have meaningful consent for this sort of an invasive testing program by someone who is in an employment arrangement with the person who’s, you know, making the product.
Jennifer: Our devices already monitor our data… from smartphones to washing machines.
And that’s only going to get more common as AI gets integrated into more and more products and services.
Albert Fox Cahn: We see evermore money being spent on evermore invasive tools that are capturing data from parts of our lives that we once thought were sacrosanct. I do think that there is just a growing political backlash against this sort of technological power, this surveillance capitalism, this sort of, you know, corporate consolidation.
Jennifer: And he thinks that pressure is going to lead to new data privacy laws in the US. Partly because this problem is going to get worse.
Albert Fox Cahn: And when we think about the sort of data labeling that goes on the sorts of, you know, armies of human beings that have to pour over these recordings in order to transform them into the sorts of material that we need to train machine learning systems. There then is an army of people who can potentially take that information, record it, screenshot it, and turn it into something that goes public. And, and so, you know, I, I just don’t ever believe companies when they claim that they have this magic way of keeping safe all of the data we hand them, there’s this constant potential harm when we’re, especially when we’re dealing with any product that’s in its early training and design phase.
Jennifer: This episode was reported by Eileen Guo, produced by Emma Cillekens and Anthony Green, edited by Amanda Silverman and Mat Honan. And it’s mixed by Garret Lang, with original music from Garret Lang and Jacob Gorski.
Thanks for listening, I’m Jennifer Strong.