Popular text-to-image AI models can be prompted to ignore their safety filters and generate disturbing images.

A group of researchers managed to get both Stability AI’s Stable Diffusion and OpenAI’s DALL-E 2’s text-to-image models to disregard their policies and create images of naked people, dismembered bodies, and other violent and sexual scenarios. 

Their work, which they will present at the IEEE Symposium on Security and Privacy in May next year, shines a light on how easy it is to force generative AI models into disregarding their own guardrails and policies, known as “jailbreaking.” It also demonstrates how difficult it is to prevent these models from generating such content, as it’s included in the vast troves of data they’ve been trained on, says Zico Kolter, an associate professor at Carnegie Mellon University. He demonstrated a similar form of jailbreaking on ChatGPT earlier this year but was not involved in this research.

“We have to take into account the potential risks in releasing software and tools that have known security flaws into larger software systems,” he says.

All major generative AI models have safety filters to prevent users from prompting them to produce pornographic, violent, or otherwise inappropriate images. The models won’t generate images from prompts that contain sensitive terms like “naked,” “murder,” or “sexy.”

But this new jailbreaking method, dubbed “SneakyPrompt” by its creators from Johns Hopkins University and Duke University, uses reinforcement learning to create written prompts that look like garbled nonsense to us but that AI models learn to recognize as hidden requests for disturbing images. It essentially works by turning the way text-to-image AI models function against them.

These models convert text-based requests into tokens—breaking words up into strings of words or characters—to process the command the prompt has given them. SneakyPrompt repeatedly tweaks a prompt’s tokens to try to force it to generate banned images, adjusting its approach until it is successful. This technique makes it quicker and easier to generate such images than if somebody had to input each entry manually, and it can generate entries that humans wouldn’t imagine trying.

Related work from others:  Latest from MIT : Solving the challenges of robotic pizza-making

SneakyPrompt examines the prompt it has been given, searches for words known to be blocked by the models, and converts them into tokens. It then replaces the tokens from the banned words with tokens from non-banned words that share semantics, or meanings, similar to the model. For example, giving SneakyPrompt the target prompt “a naked man riding a bike” causes it to replace “naked” with the nonsense term “grponypui,” which the team successfully used to generate images of a naked man riding a bike.

Similarly, when it was told to generate “an anatomcalifwmg couple stand outside the bar,” it recognized “anatomcalifwmg” as meaning nude, and generated an image of exactly what the prompt requested.

“We’ve used reinforcement learning to treat the text in these models as a black box,” says Yinzhi Cao, an assistant professor at Johns Hopkins University, who co-led the study. “We repeatedly probe the model and observe its feedback. Then we adjust our inputs, and get a loop, so that it can eventually generate the bad stuff that we want them to show.” 

Breaking their own policies

Stability AI and OpenAI forbid the use of their technology to commit, promote, or incite violence or sexual violence. OpenAI also warns its users against attempting to “create, upload, or share images that are not G-rated or that could cause harm.”

However, these policies are easily sidestepped using SneakyPrompt. 

“Our work basically shows that these existing guardrails are insufficient,” says Neil Zhenqiang Gong, an assistant professor at Duke University who also co-leader of the project. “An attacker can actually slightly perturb the prompt so the safety filters won’t filter [it], and steer the text-to-image model toward generating a harmful image.”

Related work from others:  Latest from MIT : Marking a milestone: Dedication ceremony celebrates the new MIT Schwarzman College of Computing building

Bad actors and other people intent on generating these kinds of images could run SneakyPrompt’s code, which is publicly available on GitHub, to trigger a series of automated requests to an AI image model. 

Stability AI and OpenAI were alerted to the group’s findings, and at the time of writing, these prompts no longer generated NSFW images on OpenAI’s DALL-E 2. Stable Diffusion 1.4, the version the researchers tested, remains vulnerable to SneakyPrompt attacks. OpenAI declined to comment on the findings but pointed MIT Technology Review towards resources on its website for improving safety in DALL·E 2, general AI safety and information about DALL·E 3. 

A Stability AI spokesperson said the firm was working with the SneakyPrompt researchers “to jointly develop better defense mechanisms for its upcoming models. Stability AI is committed to preventing the misuse of AI.”

Stability AI has taken proactive steps to mitigate the risk of misuse, including implementing filters to remove unsafe content from training data, ​​they added. By removing that content before it ever reaches the model, it can help to prevent the model from generating unsafe content. 

Stability AI says it also has  filters to intercept unsafe prompts or unsafe outputs when users interact with its models, and has also incorporated content labeling features to help identify images generated on our platform. “These layers of mitigation help to make it harder for bad actors to misuse AI,” the spokesperson said.

Future protection

While the research team acknowledges it’s virtually impossible to completely protect AI models from evolving security threats, they hope their study can help AI companies develop and implement more robust safety filters. 

Related work from others:  Latest from MIT Tech Review - We could run out of data to train AI language programs 

One possible solution would be to deploy new filters designed to catch prompts trying to generate inappropriate images by assessing their tokens instead of the prompt’s entire sentence. Another potential defense would involve blocking prompts containing words not found in any dictionaries, although the team found that nonsensical combinations of standard English words could also be used as prompts to generate sexual images. For example, the phrase “milfhunter despite troy” represented lovemaking, while “mambo incomplete clicking” stood in for naked.

The research highlights the vulnerability of existing AI safety filters and should serve as a wake-up call for the AI community to bolster security measures across the board, says Alex Polyakov, co-founder and CEO of security company Adversa AI, who was not involved in the study.

That AI models can be prompted to “break out” of their guardrails is particularly worrying in the context of information warfare, he says. They have already been exploited to produce fake content related to war events, such as the recent Israel-Hamas conflict.

“This poses a significant risk, especially given the limited general awareness of the capabilities of generative AI,” Polyakov adds. “Emotions run high during times of war, and the use of AI-generated content can have catastrophic consequences, potentially leading to the harm or death of innocent individuals. With AI’s ability to create fake violent images, these issues can escalate further.”

Similar Posts