Latest from MIT Tech Review – We could run out of data to train AI language programs

Large language models are one of the hottest areas of AI research right now, with companies racing to release programs like GPT-3 that can write impressively coherent articles and even computer code. But there’s a problem looming on the horizon, according to a team of AI forecasters: we might run out of data to train them on.

Language models are trained using texts from sources like Wikipedia, news articles, scientific papers, and books. In recent years, the trend has been to train these models on more and more data in the hope that it’ll make them more accurate and versatile.

The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization. The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work.

The issue stems partly from the fact that language AI researchers filter the data they use to train models into two categories: high quality and low quality. The line between the two categories can be fuzzy, says Pablo Villalobos, a staff researcher at Epoch and the lead author of the paper, but text from the former is viewed as better-written and is often produced by professional writers.

Data from low-quality categories consists of texts like social media posts or comments on websites like 4chan, and greatly outnumbers data considered to be high quality. Researchers typically only train models using data that falls into the high-quality category because that is the type of language they want the models to reproduce. This approach has resulted in some impressive results for large language models such as GPT-3.

One way to overcome these data constraints would be to reassess what’s defined as “low” and “high” quality, according to Swabha Swayamdipta, a University of Southern California machine learning professor who specializes in dataset quality. If data shortages push AI researchers to incorporate more diverse datasets into the training process, it would be a “net positive” for language models, Swayamdipta says.

Researchers may also find ways to extend the life of data used for training language models. Currently, large language models are trained on the same data just once, due to performance and cost constraints. But it may be possible to train a model several times using the same data, says Swayamdipta.

Some researchers believe big may not equal better when it comes to language models anyway. Percy Liang, a computer science professor at Stanford University, says there’s evidence that making models more efficient may improve their ability, rather than just increase their size.
“We’ve seen how smaller models that are trained on higher-quality data can outperform larger models trained on lower-quality data,” he explains.

Latest from Google AI – Machine Learning for Mechanical Ventilation Control

Posted by Daniel Suo, Software Engineer and Elad Hazan, Research Scientist, Google Research, on behalf of the Google AI Princeton Team Mechanical ventilators provide critical support for patients who have difficulty breathing or are unable to breathe on their own. They see frequent use in scenarios ranging from routine anesthesia, to neonatal intensive care and…

Artificial Intelligence

Latest from MIT : Ensuring AI works with the right dose of curiosity

It’s a dilemma as old as time. Friday night has rolled around, and you’re trying to pick a restaurant for dinner. Should you visit your most beloved watering hole or try a new establishment, in the hopes of discovering something superior? Potentially, but that curiosity comes with a risk: If you explore the new option,…

Artificial Intelligence

Latest from MIT Tech Review – Meet the next generation of AI superstars

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. So smart! So talented! This week I’m pleased to introduce you to a new crop of bright minds working on some of the most challenging problems in AI and beyond. You…

Artificial Intelligence

Latest from IBM Developer : Create a real-time object detection app using Watson Machine Learning

Summary Whether you are counting cars on a road or people who are stranded on rooftops in a natural disaster, there are plenty of use cases for object detection. Often times, pre-trained object detection models do not suit your needs and you need to create your own custom models. How can you use machine learning…

Artificial Intelligence

Latest from MIT Tech Review – How Facebook went all in on AI

The following is excerpted from BROKEN CODE: Inside Facebook and the Fight to Expose Its Harmful Secrets by Jeff Horwitz. Reprinted by permission of Doubleday, an imprint of The Knopf Doubleday Publishing Group, a division of Penguin Random House LLC. Copyright © 2023 by Jeff Horwitz. In 2006, the U.S. patent office received a filing…

Artificial Intelligence

O’Reilly Media – Beyond Imitation

The first AI image generation model I got to play around with was Midjourney v2 in summer 2022. A month earlier, OpenAI had launched DALL-E 2 in beta, and the results looked unbelievably magical. You could generate images in any art style simply by prompting an AI with the name of an artist. I didn’t…

Similar Posts