This story first appeared in China Report, MIT Technology Review’s newsletter about technology in China. Sign up to receive it in your inbox every Tuesday.

Last week’s release of GPT-4o, a new AI “omnimodel” that you can interact with using voice, text, or video, was supposed to be a big moment for OpenAI. But just days later, it feels as if the company is in big trouble. From the resignation of most of its safety team to Scarlett Johansson’s accusation that it replicated her voice for the model against her consent, it’s now in damage-control mode. 

Add to that another thing OpenAI fumbled with GPT-4o: the data it used to train its tokenizer—a tool that helps the model parse and process text more efficiently—is polluted by Chinese spam websites. As a result, the model’s Chinese token library is full of phrases related to pornography and gambling. This could worsen some problems that are common with AI models: hallucinations, poor performance, and misuse. 

I wrote about it on Friday after several researchers and AI industry insiders flagged the problem. They took a look at GPT-4o’s public token library, which has been significantly updated with the new model to improve support of non-English languages, and saw that more than 90 of the 100 longest Chinese tokens in the model are from spam websites. These are phrases like “_free Japanese porn video to watch,” “Beijing race car betting,” and “China welfare lottery every day.”

Anyone who reads Chinese could spot the problem with this list of tokens right away. Some such phrases inevitably slip into training data sets because of how popular adult content is online, but for them to account for 90% of the Chinese language used to train the model? That’s alarming.

“It’s an embarrassing thing to see as a Chinese person. Is that just how the quality of the [Chinese] data is? Is it because of insufficient data cleaning or is the language just like that?” says Zhengyang Geng, a PhD student in computer science at Carnegie Mellon University. 

It could be tempting to draw a conclusion about a language or a culture from the tokens OpenAI chose for GPT-4o. After all, these are selected as commonly seen and significant phrases from the respective languages. There’s an interesting blog post by a Hong Kong–based researcher named Henry Luo, who queried the longest GPT-4o tokens in various different languages and found that they seem to have different themes. While the tokens in Russian reflect language about the government and public institutions, the tokens in Japanese have a lot of different ways to say “thank you.”

Related work from others:  Latest from MIT : From steel engineering to ovarian tumor research

But rather than reflecting the differences between cultures or countries, I think this explains more about what kind of training data is readily available online, and the websites OpenAI crawled to feed into GPT-4o.

After I published the story, Victor Shih, a political science professor at the University of California, San Diego, commented on it on X: “When you try not [to] train on Chinese state media content, this is what you get.”

It’s half a joke, and half a serious point about the two biggest problems in training large language models to speak Chinese: the readily available data online reflects either the “official,” sanctioned way of talking about China or the omnipresent spam content that drowns out real conversations.

In fact, among the few long Chinese tokens in GPT-4o that aren’t either pornography or gambling nonsense, two are “socialism with Chinese characteristics” and “People’s Republic of China.” The presence of these phrases suggests that a significant part of the training data actually is from Chinese state media writings, where formal, long expressions are extremely common.

OpenAI has historically been very tight-lipped about the data it uses to train its models, and it probably will never tell us how much of its Chinese training database is state media and how much is spam. (OpenAI didn’t respond to MIT Technology Review’s detailed questions sent on Friday.)

But it is not the only company struggling with this problem. People inside China who work in its AI industry agree there’s a lack of quality Chinese text data sets for training LLMs. One reason is that the Chinese internet used to be, and largely remains, divided up by big companies like Tencent and ByteDance. They own most of the social platforms and aren’t going to share their data with competitors or third parties to train LLMs. 

Related work from others:  Latest from Google AI - A decoder-only foundation model for time-series forecasting

In fact, this is also why search engines, including Google, kinda suck when it comes to searching in Chinese. Since WeChat content can only be searched on WeChat, and content on Douyin (the Chinese TikTok) can only be searched on Douyin, this data is not accessible to a third-party search engine, let alone an LLM. But these are the platforms where actual human conversations are happening, instead of some spam website that keeps trying to draw you into online gambling.

The lack of quality training data is a much bigger problem than the failure to filter out the porn and general nonsense in GPT-4o’s token-training data. If there isn’t an existing data set, AI companies have to put in significant work to identify, source, and curate their own data sets and filter out inappropriate or biased content. 

It doesn’t seem OpenAI did that, which in fairness makes some sense, given that people in China can’t use its AI models anyway. 

Still, there are many people living outside China who want to use AI services in Chinese. And they deserve a product that works properly as much as speakers of any other language do. 

How can we solve the problem of the lack of good Chinese LLM training data? Tell me your idea at zeyi@technologyreview.com.

Now read the rest of China Report

Catch up with China

1. China launched an anti-dumping investigation into imports of polyoxymethylene copolymer—a widely used plastic in electronics and cars—from the US, the EU, Taiwan, and Japan. It’s widely seen as a response to the new US tariff announced on Chinese EVs. (BBC)

Meanwhile, Latin American countries, including Mexico, Chile, and Brazil, have increased tariffs on Chinese-imported steel, testing China’s relationship with the region. (Bloomberg $)

2. China’s solar-industry boom is incentivizing farmers to install solar panels and make some extra cash by selling the electricity they generate. (Associated Press)

3. Hedging against the potential devaluation of the RMB, Chinese buyers are pushing the price of gold to all-time highs. (Financial Times $)

4. The Shanghai government set up a pilot project that allows data to be transferred out of China without going through the much-dreaded security assessments, a move that has been sought by companies like Tesla. (Reuters $)

Related work from others:  Latest from MIT Tech Review - GPT-4 is bigger and better than ChatGPT—but OpenAI won’t say why

5. China’s central bank fined seven businesses—including a KFC and branches of state-owned corporations—for rejecting cash payments. The popularization of mobile payment has been a good thing, but the dwindling support for cash is also making life harder for people like the elderly and foreign tourists. (Business Insider $)

6. Alibaba and Baidu are waging an LLM price war in China to attract more users. (Bloomberg $

7. The Chinese government has sanctioned Mike Gallagher, a former Republican congressman who chaired the Select Committee on China and remains a fierce critic of Beijing. (NBC News)

Lost in translation

China’s National Health Commission is exploring the relaxation of stringent rules around human genetic data to boost the biotech industry, according to the Chinese publication Caixin. A regulation enacted in 1998 required any research that involves the use of this data to clear an approval process. And there’s even more scrutiny if the research involves foreign institutions. 

In the early years of human genetic research, the regulation helped prevent the nonconsensual collection of DNA. But as the use of genetic data becomes increasingly important in discovering new treatments, the industry has been complaining about the bureaucracy, which can add an extra two to four months to research projects. Now the government is holding discussions on how to revise the regulation, potentially lifting the approval process for smaller-scale research and more foreign entities, as part of a bid to accelerate the growth of biotech research in China.

One more thing

Did you know that the Beijing Capital International Airport has been employing birds of prey to chase away other birds since 2019? This month, the second generation of Beijing’s birdy employees started their work driving away the migratory birds that could endanger aircraft. The airport even has different kinds of raptors—Eurasian hobbies, Eurasian goshawks, and Eurasian sparrowhawks—to deal with the different bird species that migrate to Beijing at different times.

Similar Posts