Since the beginning of the generative AI boom, there has been a fight over how large AI models are trained. In one camp sit tech companies such as OpenAI that have claimed it is “impossible” to train AI without hoovering the internet of copyrighted data. And in the other camp are artists who argue that AI companies have taken their intellectual property without consent and compensation.
Adobe is pretty unusual in that it sides with the latter group, with an approach that stands out as an example of how generative AI products can be built without scraping copyrighted data from the internet. Adobe released its image-generating model Firefly, which is integrated into its popular photo editing tool Photoshop, one year ago.
In an exclusive interview with MIT Technology Review, Adobe’s AI leaders are adamant this is the only way forward. At stake is not just the livelihood of creators, they say, but our whole information ecosystem. What they have learned shows that building responsible tech doesn’t have to come at the cost of doing business.
“We worry that the industry, Silicon Valley in particular, does not pause to ask the ‘how’ or the ‘why.’ Just because you can build something doesn’t mean you should build it without consideration of the impact that you’re creating,” says David Wadhwani, senior vice president of Adobe’s digital media business.
Those questions guided the creation of Firefly. When the generative image boom kicked off in 2022, there was a major backlash against AI from creative communities. Many people were using generative AI models as derivative content machines to create images in the style of another artist, sparking a legal fight over copyright and fair use. The latest generative AI technology has also made it much easier to create deepfakes and misinformation.
It soon became clear that to offer creators proper credit and businesses legal certainty, the company could not build its models by scraping the web of data, Wadwani says.
Adobe wants to reap the benefits of generative AI while still “recognizing that these are built on the back of human labor. And we have to figure out how to fairly compensate people for that labor now and in the future,” says Ely Greenfield, Adobe’s chief technology officer for digital media.
To scrape or not to scrape
The scraping of online data, commonplace in AI, has recently become highly controversial. AI companies such as OpenAI, Stability.AI, Meta, and Google are facing numerous lawsuits over AI training data. Tech companies argue that publicly available data is fair game. Writers and artists disagree and are pushing for a license-based model, where creators would get compensated for having their work included in training datasets.
Adobe trained Firefly on content that had an explicit license allowing AI training, which means the bulk of the training data comes from Adobe’s library of stock photos, says Greenfield. The company offers creators extra compensation when material is used to train AI models, he adds.
This is in contrast to the status quo in AI today, where tech companies scrape the web indiscriminately and have a limited understanding of what of what the training data includes. Because of these practices, the AI datasets inevitably include copyrighted content and personal data, and research has uncovered toxic content, such as child sexual abuse material.
Scraping the internet gives tech companies a cheap way to get lots of AI training data, and traditionally, having more data has allowed developers to build more powerful models. Limiting Firefly to licensed data for training was a risky bet, says Greenfield.
“To be honest, when we started with Firefly with our image model, we didn’t know whether or not we would be able to satisfy customer needs without scraping the web,” says Greenfield.
“And we found we could, which was great.”
Human content moderators also review the training data to weed out objectionable or harmful content, known intellectual property, and images of known people, and the company has licenses for everything its products train on.
Adobe’s strategy has been to integrate generative AI tools into its existing products, says Greenfield. In Photoshop, for example, Firefly users can fill in areas of an image using text commands. This allows them much more control over the creative process, and it aids their creativity.
Still, more work needs to be done. The company wants to make Firefly even faster. Currently it takes around 10 seconds for the company’s content moderation algorithms to check the outputs of the model, for example, Greenfield says. Adobe is also trying to figure out how some business customers could generate copyrighted content, such as Marvel characters or Mickey Mouse. Adobe has teamed up with companies such as IBM, Mattel, NVIDIA and NASCAR, which allows these companies to use the tool with their intellectual property. It is also working on audio, lip synching tools and 3D generation.
Garbage in, garbage out
The decision to not scrape the internet also gives Adobe an edge in content moderation. Generative AI is notoriously difficult to control, and developers themselves don’t know why the models generate the images and texts they do. Generative AI models have put out questionable and toxic content in numerous cases.
That all comes down to what it has been trained on, Greenfield says. He says Adobe’s model has never seen a picture of Joe Biden or Donald Trump, for example, and it cannot be coaxed into generating political misinformation. The AI model’s training data has no news content or famous people. It has not been trained on any copyrighted material, such as images of Mickey Mouse.
“It just doesn’t understand what that concept is,” says Greenfield.
Adobe also applies automated content moderation at the point of creation to check that Firefly’s creations are safe for professional use. The model is prohibited from creating news stories or violent images. Some names of artists are also blocked. Firefly-generated content comes with labels that indicate it has been created using AI, and the image’s edit history.
During a critical election year, the need to know who made a piece of content, and how, are especially important. Adobe has been a vocal advocate for labels on AI content that tell where it originated, and with whom.
The company started the Content Authenticity Initiative, an association promoting the use of labels which tell you whether content is AI-generated or not, along with the New York Times and Twitter (now X). The initiative now has over 2,500 members. It is also part of developing C2PA, an industry standard label which shows where a piece of content has come from, and how it was created.
“We’re long overdue [for] a better education in media literacy and tools that support people’s ability to validate any content that claims to represent reality,” Greenfield says.
Adobe’s approach highlights the need for AI companies to be thinking deeply about content moderation, says Claire Leibowicz, head of AI and media integrity at the nonprofit Partnership on AI.
Adobe’s approach toward generative AI serves those societal goals by fighting misinformation as well as business goals, such as preserving creator autonomy and attribution, adds Leibowicz.
“The business mission of Adobe is not to prevent misinformation, per se,” she says. “It’s to empower creators. And isn’t this a really elegant confluence of mission and tactics, to be able to kill two birds with one stone?”
Wadhwani agrees. The company says Firefly-powered features are among its most popular, and 90% of Firefly’s web app users are entirely new customers to Adobe.
“I think our approach has definitely been good for business,” Wadhwani says.