Latest from MIT Tech Review – We could run out of data to train AI language programs

Large language models are one of the hottest areas of AI research right now, with companies racing to release programs like GPT-3 that can write impressively coherent articles and even computer code. But there’s a problem looming on the horizon, according to a team of AI forecasters: we might run out of data to train them on.

Language models are trained using texts from sources like Wikipedia, news articles, scientific papers, and books. In recent years, the trend has been to train these models on more and more data in the hope that it’ll make them more accurate and versatile.

The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization. The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work.

The issue stems partly from the fact that language AI researchers filter the data they use to train models into two categories: high quality and low quality. The line between the two categories can be fuzzy, says Pablo Villalobos, a staff researcher at Epoch and the lead author of the paper, but text from the former is viewed as better-written and is often produced by professional writers.

Data from low-quality categories consists of texts like social media posts or comments on websites like 4chan, and greatly outnumbers data considered to be high quality. Researchers typically only train models using data that falls into the high-quality category because that is the type of language they want the models to reproduce. This approach has resulted in some impressive results for large language models such as GPT-3.

One way to overcome these data constraints would be to reassess what’s defined as “low” and “high” quality, according to Swabha Swayamdipta, a University of Southern California machine learning professor who specializes in dataset quality. If data shortages push AI researchers to incorporate more diverse datasets into the training process, it would be a “net positive” for language models, Swayamdipta says.

Researchers may also find ways to extend the life of data used for training language models. Currently, large language models are trained on the same data just once, due to performance and cost constraints. But it may be possible to train a model several times using the same data, says Swayamdipta.

Some researchers believe big may not equal better when it comes to language models anyway. Percy Liang, a computer science professor at Stanford University, says there’s evidence that making models more efficient may improve their ability, rather than just increase their size.
“We’ve seen how smaller models that are trained on higher-quality data can outperform larger models trained on lower-quality data,” he explains.

Artificial Intelligence

Latest from Google AI – Multimodal medical AI

Posted by Greg Corrado, Head of Health AI, Google Research, and Yossi Matias, VP, Engineering and Research, Google Research Medicine is an inherently multimodal discipline. When providing care, clinicians routinely interpret data from a wide range of modalities including medical images, clinical notes, lab tests, electronic health records, genomics, and more. Over the last decade…

Artificial Intelligence

Latest from Google AI – Locked-image Tuning: Adding Language Understanding to Image Models

Posted by Andreas Steiner and Basil Mustafa, Research Software Engineers at Google Research, Brain team The ability to classify images into categories has been transformed by deep learning. It has also been significantly accelerated by transfer learning, whereby models are first pre-trained on large datasets, like ImageNet, to learn visual representations that are then transferred…

Artificial Intelligence

Latest from MIT Tech Review – You need to talk to your kid about AI. Here are 6 things you should say.

In the past year, kids, teachers, and parents have had a crash course in artificial intelligence, thanks to the wildly popular AI chatbot ChatGPT. In a knee-jerk reaction, some schools, such as the New York City public schools, banned the technology—only to cancel the ban months later. Now that many adults have caught up with…

Artificial Intelligence

Latest from Google AI – PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

Posted by Wenhao Yu, Research Scientist, Robotics at Google, and Kuang-Huei Lee, Research Engineer, Google Research, Brain team Evolution strategy (ES) is a family of optimization techniques inspired by the ideas of natural selection: a population of candidate solutions are usually evolved over generations to better adapt to an optimization objective. ES has been applied…

Artificial Intelligence

Latest from MIT Tech Review – Watch this robot as it learns to stitch up wounds

An AI-trained surgical robot that can make a few stitches on its own is a small step toward systems that can aid surgeons with such repetitive tasks. A video taken by researchers at the University of California, Berkeley, shows the two-armed robot completing six stitches in a row on a simple wound in imitation skin,…

Artificial Intelligence

Latest from MIT : MIT researchers “speak objects into existence” using AI and robotics

Generative AI and robotics are moving us ever closer to the day when we can ask for an object and have it created within a few minutes. In fact, MIT researchers have developed a speech-to-reality system, an AI-driven workflow that allows them to provide input to a robotic arm and “speak objects into existence,” creating…

Similar Posts