O’Reilly Media – Scaling False Peaks
Humans are notoriously poor at judging distances. There’s a tendency to underestimate, whether it’s the distance along a straight road with a clear run to the horizon or the distance across a valley. When ascending toward a summit, estimation is further confounded by false summits. What you thought was your goal and end point turns out to be a lower peak or simply a contour that, from lower down, looked like a peak. You thought you made it–or were at least close–but there’s still a long way to go.
The story of AI is a story of punctuated progress, but it is also the story of (many) false summits.
In the 1950s, machine translation of Russian into English was considered to be no more complex than dictionary lookups and templated phrases. Natural language processing has come a very long way since then, having burnt through a good few paradigms to get to something we can use on a daily basis. In the 1960s, Marvin Minsky and Seymour Papert proposed the Summer Vision Project for undergraduates: connect a TV camera to a computer and identify objects in the field of view. Computer vision is now something that is commodified for specific tasks, but it continues to be a work in progress and, worldwide, has taken more than a few summers (and AI winters) and many more than a few undergrads.
We can find many more examples across many more decades that reflect naiveté and optimism and–if we are honest–no small amount of ignorance and hubris. The two general lessons to be learned here are not that machine translation involves more than lookups and that computer vision involves more than edge detection, but that when we are confronted by complex problems in unfamiliar domains, we should be cautious of anything that looks simple at first sight, and that when we have successful solutions to a specific sliver of a complex domain, we should not assume those solutions are generalizable. This kind of humility is likely to deliver more meaningful progress and a more measured understanding of such progress. It is also likely to reduce the number of pundits in the future who mock past predictions and ambitions, along with the recurring irony of machine-learning experts who seem unable to learn from the past trends in their own field.
All of which brings us to DeepMind’s Gato and the claim that the summit of artificial general intelligence (AGI) is within reach. The hard work has been done and reaching AGI is now a simple matter of scaling. At best, this is a false summit on the right path; at worst, it’s a local maximum far from AGI, which lies along a very different route in a different range of architectures and thinking.
DeepMind’s Gato is an AI model that can be taught to carry out many different kinds of tasks based on a single transformer neural network. The 604 tasks Gato was trained on vary from playing Atari video games to chat, from navigating simulated 3D environments to following instructions, from captioning images to real-time, real-world robotics. The achievement of note is that it’s underpinned by a single model trained across all tasks rather than different models for different tasks and modalities. Learning how to ace Space Invaders does not interfere with or displace the ability to carry out a chat conversation.
Gato was intended to “test the hypothesis that training an agent which is generally capable on a large number of tasks is possible; and that this general agent can be adapted with little extra data to succeed at an even larger number of tasks.” In this, it succeeded. But how far can this success be generalized in terms of loftier ambitions? The tweet that provoked a wave of responses (this one included) came from DeepMind’s research director, Nando de Freitas: “It’s all about scale now! The game is over!”
The game in question is the quest for AGI, which is closer to what science fiction and the general public think of as AI than the narrower but applied, task-oriented, statistical approaches that constitute commercial machine learning (ML) in practice.
The claim is that AGI is now simply a matter of improving performance, both in hardware and software, and making models bigger, using more data and more kinds of data across more modes. Sure, there’s research work to be done, but now it’s all about turning the dials up to 11 and beyond and, voilà, we’ll have scaled the north face of the AGI to plant a flag on the summit.
It’s easy to get breathless at altitude.
When we look at other systems and scales, it’s easy to be drawn to superficial similarities in the small and project them into the large. For example, if we look at water swirling down a plughole and then out into the cosmos at spiral galaxies, we see a similar structure. But these spirals are more closely bound in our desire to see connection than they are in physics. In looking at scaling specific AI to AGI, it’s easy to focus on tasks as the basic unit of intelligence and ability. What we know of intelligence and learning systems in nature, however, suggests the relationships between tasks, intelligence, systems, and adaptation is more complex and more subtle. Simply scaling up one dimension of ability may simply scale up one dimension of ability without triggering emergent generalization.
If we look closely at software, society, physics or life, we see that scaling is usually accompanied by fundamental shifts in organizing principle and process. Each scaling of an existing approach is successful up to a point, beyond which a different approach is needed. You can run a small business using office tools, such as spreadsheets, and a social media page. Reaching Amazon-scale is not a matter of bigger spreadsheets and more pages. Large systems have radically different architectures and properties to either the smaller systems they are built from or the simpler systems that came before them.
It may be that artificial general intelligence is a far more significant challenge than taking task-based models and increasing data, speed, and number of tasks. We typically underappreciate how complex such systems are. We divide and simplify, make progress as a result, only to discover, as we push on, that the simplification was just that; a new model, paradigm, architecture, or schedule is needed to make further progress. Rinse and repeat. Put another way, just because you got to basecamp, what makes you think you can make the summit using the same approach? And what if you can’t see the summit? If you don’t know what you’re aiming for, it’s difficult to plot a course to it.
Instead of assuming the answer, we need to ask: How do we define AGI? Is AGI simply task-based AI for N tasks and a sufficiently large value of N? And, even if the answer to that question is yes, is the path to AGI necessarily task-centric? How much of AGI is performance? How much of AGI is big/bigger/biggest data?
When we look at life and existing learning systems, we learn that scale matters, but not in the sense suggested by a simple multiplier. It may well be that the trick to cracking AGI is to be found in scaling–but down rather than up.
Doing more with less looks to be more important than doing more with more. For example, the GPT-3 language model is based on a network of 175 billion parameters. The first version of DALL-E, the prompt-based image generator, used a 12-billion parameter version of GPT-3; the second, improved version used only 3.5 billion parameters. And then there’s Gato, which achieves its multitask, multimodal abilities with only 1.2 billion.
These reductions hint at the direction, but it’s not clear that Gato’s, GPT-3’s or any other contemporary architecture is necessarily the right vehicle to reach the destination. For example, how many training examples does it take to learn something? For biological systems, the answer is, in general, not many; for machine learning, the answer is, in general, very many. GPT-3, for example, developed its language model based on 45TB of text. Over a lifetime, a human reads and hears of the order of a billion words; a child is exposed to ten million or so before starting to talk. Mosquitoes can learn to avoid a particular pesticide after a single non-lethal exposure. When you learn a new game–whether video, sport, board or card–you generally only need to be told the rules and then play, perhaps with a game or two for practice and rule clarification, to make a reasonable go of it. Mastery, of course, takes far more practice and dedication, but general intelligence is not about mastery.
And when we look at the hardware and its needs, consider that while the brain is one of the most power-hungry organs of the human body, it still has a modest power consumption of around 12 watts. Over a life the brain will consume up to 10 MWh; training the GPT-3 language model took an estimated 1 GWh.
When we talk about scaling, the game is only just beginning.
While hardware and data matter, the architectures and processes that support general intelligence may be necessarily quite different to the architectures and processes that underpin current ML systems. Throwing faster hardware and all the world’s data at the problem is likely to see diminishing returns, although that may well let us scale a false summit from which we can see the real one.