Hidden patterns buried in AI-generated texts could help identify them as such, allowing us to tell whether the words we’re reading are written by a human or not.
These “watermarks” are invisible to the human eye but let computers detect that the text probably comes from an AI system. If embedded in large language models, they could help prevent some of the problems that these models have already caused.
For example, since OpenAI’s chatbot ChatGPT was launched in November students have already started using it to cheat by writing essays for them. News website CNET has used ChatGPT to write articles, only to have to issue corrections amid accusations of plagiarism. But there is a promising way to spot AI text: by embedding hidden patterns that let us identify AI-generated text into these systems before they’re released.
In studies, these watermarks have already shown that they can identify AI-generated text with near certainty. One, developed by a team at the University of Maryland, was able to spot text created by Meta’s open source language model, OPT-6.7B, using a detection algorithm they built. The work is described in a paper that’s yet to be peer reviewed, and the code will be available for free around February 15.
AI language models work by predicting and generating one word at a time. After each word, the watermarking algorithm randomly divides the language model’s vocabulary into words on a “greenlist” and a “redlist,” and then prompts the language model to choose words on the greenlist.
The more greenlisted words in a passage, the more likely it is that the text is generated by a machine. Text written by a person tends to contain a more random mix of words. For example, for the word “beautiful”, the watermarking algorithm could classify the word “flower” as green, and “orchid” as red. The AI model with the watermarking algorithm would be more likely to use the word “flower” than “orchid,” explains Tom Goldstein, an assistant professor at the University of Maryland, who was involved in the research.
ChatGPT is one of a new breed of large language models that generate fluent text that reads like a human could have written it. These AI models regurgitate facts confidently, but are notorious for spewing falsehoods and biases. To the untrained eye, it is almost impossible to detect whether a passage is written by an AI model or human. The breathtaking speed of AI development means that new, more powerful models quickly make our existing synthetic text detection toolkit less effective. It’s a constant race between AI developers to build new safety tools that can match the latest generation of AI models.
“Right now, it’s the Wild West,” says John Kirchenbauer, a researcher at the University of Maryland, who was involved in the watermarking work. He hopes watermarking tools might give AI-detection efforts the edge. The tool his team has developed could be adjusted to work with any AI language model that predicts the next word, he says.
The findings are both promising and timely, says Irene Solaiman, policy director at AI startup Hugging Face, who worked on studying AI output detection in her previous role as an AI researcher at OpenAI, but was not involved in this research.
“As models are being deployed at scale, more people outside the AI community, likely without computer science training, will need to access detection methods,” says Solaiman.
There are limitations to this new method, however. Watermarking only works if it is embedded in the large language model by its creators right from the beginning. Although OpenAI is reputedly working on methods to detect AI-generated text, including watermarks, it remains highly secretive. The company doesn’t tend to give external parties much information about how ChatGPT works or was trained, much less access to tinker with it. OpenAI didn’t immediately respond to our request for comment.
It’s also unclear how this will apply to other models besides Meta’s, such as ChatGPT, Solaiman says. The AI model the watermark was tested on is also smaller than popular models like ChatGPT.
The researchers say that options for fighting back against watermarking methods are limited. “You’d have to change about half the words in a passage of text before the watermark could be removed,” says Goldstein. However, more testing is needed to explore different ways advanced attackers might try to remove the watermark.
“It’s dangerous to underestimate high schoolers so I won’t do that, but generally the average person will likely be unable to tamper with this kind of watermark,” says Solaiman.