DeepMind’s new model, Gato, has sparked a debate on whether artificial general intelligence (AGI) is nearer–almost at hand–just a matter of scale.  Gato is a model that can solve multiple unrelated problems: it can play a large number of different games, label images, chat, operate a robot, and more.  Not so many years ago, one problem with AI was that AI systems were only good at one thing. After IBM’s Deep Blue defeated Garry Kasparov in chess,  it was easy to say “But the ability to play chess isn’t really what we mean by intelligence.” A model that plays chess can’t also play space wars. That’s obviously no longer true; we can now have models capable of doing many different things. 600 things, in fact, and future models will no doubt do more.

So, are we on the verge of artificial general intelligence, as Nando de Frietas (research director at DeepMind) claims? That the only problem left is scale? I don’t think so.  It seems inappropriate to be talking about AGI when we don’t really have a good definition of “intelligence.” If we had AGI, how would we know it? We have a lot of vague notions about the Turing test, but in the final analysis, Turing wasn’t offering a definition of machine intelligence; he was probing the question of what human intelligence means.

Consciousness and intelligence seem to require some sort of agency.  An AI can’t choose what it wants to learn, neither can it say “I don’t want to play Go, I’d rather play Chess.” Now that we have computers that can do both, can they “want” to play one game or the other? One reason we know our children (and, for that matter, our pets) are intelligent and not just automatons is that they’re capable of disobeying. A child can refuse to do homework; a dog can refuse to sit. And that refusal is as important to intelligence as the ability to solve differential equations, or to play chess. Indeed, the path towards artificial intelligence is as much about teaching us what intelligence isn’t (as Turing knew) as it is about building an AGI.

Even if we accept that Gato is a huge step on the path towards AGI, and that scaling is the only problem that’s left, it is more than a bit problematic to think that scaling is a problem that’s easily solved. We don’t know how much power it took to train Gato, but GPT-3 required about 1.3 Gigawatt-hours: roughly 1/1000th the energy it takes to run the Large Hadron Collider for a year. Granted, Gato is much smaller than GPT-3, though it doesn’t work as well; Gato’s performance is generally inferior to that of single-function models. And granted, a lot can be done to optimize training (and DeepMind has done a lot of work on models that require less energy). But Gato has just over 600 capabilities, focusing on natural language processing, image classification, and game playing. These are only a few of many tasks an AGI will need to perform. How many tasks would a machine be able to perform to qualify as a “general intelligence”? Thousands?  Millions? Can those tasks even be enumerated? At some point, the project of training an artificial general intelligence sounds like something from Douglas Adams’ novel The Hitchhiker’s Guide to the Galaxy, in which the Earth is a computer designed by an AI called Deep Thought to answer the question “What is the question to which 42 is the answer?”

Building bigger and bigger models in hope of somehow achieving general intelligence may be an interesting research project, but AI may already have achieved a level of performance that suggests specialized training on top of existing foundation models will reap far more short term benefits. A foundation model trained to recognize images can be trained further to be part of a self-driving car, or to create generative art. A foundation model like GPT-3 trained to understand and speak human language can be trained more deeply to write computer code.

Related work from others:  Latest from MIT Tech Review - How OpenAI is trying to make ChatGPT safer and less biased

Yann LeCun posted a Twitter thread about general intelligence (consolidated on Facebook) stating some “simple facts.” First, LeCun says that there is no such thing as “general intelligence.” LeCun also says that “human level AI” is a useful goal–acknowledging that human intelligence itself is something less than the type of general intelligence sought for AI. All humans are specialized to some extent. I’m human; I’m arguably intelligent; I can play Chess and Go, but not Xiangqi (often called Chinese Chess) or Golf. I could presumably learn to play other games, but I don’t have to learn them all. I can also play the piano, but not the violin. I can speak a few languages. Some humans can speak dozens, but none of them speak every language.

There’s an important point about expertise hidden in here: we expect our AGIs to be “experts” (to beat top-level Chess and Go players), but as a human, I’m only fair at chess and poor at Go. Does human intelligence require expertise? (Hint: re-read Turing’s original paper about the Imitation Game, and check the computer’s answers.) And if so, what kind of expertise? Humans are capable of broad but limited expertise in many areas, combined with deep expertise in a small number of areas. So this argument is really about terminology: could Gato be a step towards human-level intelligence (limited expertise for a large number of tasks), but not general intelligence?

LeCun agrees that we are missing some “fundamental concepts,” and we don’t yet know what those fundamental concepts are. In short, we can’t adequately define intelligence. More specifically, though, he mentions that “a few others believe that symbol-based manipulation is necessary.” That’s an allusion to the debate (sometimes on Twitter) between LeCun and Gary Marcus, who has argued many times that combining deep learning with symbolic reasoning is the only way for AI to progress. (In his response to the Gato announcement, Marcus labels this school of thought “Alt-intelligence.”) That’s an important point: impressive as models like GPT-3 and GLaM are, they make a lot of mistakes. Sometimes those are simple mistakes of fact, such as when GPT-3 wrote an article about the United Methodist Church that got a number of basic facts wrong. Sometimes, the mistakes reveal a horrifying (or hilarious, they’re often the same) lack of what we call “common sense.” Would you sell your children for refusing to do their homework? (To give GPT-3 credit, it points out that selling your children is illegal in most countries, and that there are better forms of discipline.)

It’s not clear, at least to me, that these problems can be solved by “scale.” How much more text would you need to know that humans don’t, normally, sell their children? I can imagine “selling children” showing up in sarcastic or frustrated remarks by parents, along with texts discussing slavery. I suspect there are few texts out there that actually state that selling your children is a bad idea. Likewise, how much more text would you need to know that Methodist general conferences take place every four years, not annually? The general conference in question generated some press coverage, but not a lot; it’s reasonable to assume that GPT-3 had most of the facts that were available. What additional data would a large language model need to avoid making these mistakes? Minutes from prior conferences, documents about Methodist rules and procedures, and a few other things. As modern datasets go, it’s probably not very large; a few gigabytes, at most. But then the question becomes “How many specialized datasets would we need to train a general intelligence so that it’s accurate on any conceivable topic?”  Is that answer a million?  A billion?  What are all the things we might want to know about? Even if any single dataset is relatively small, we’ll soon find ourselves building the successor to Douglas Adams’ Deep Thought.

Related work from others:  Latest from MIT Tech Review - How this grassroots effort could make AI voices more diverse

Scale isn’t going to help. But in that problem is, I think, a solution. If I were to build an artificial therapist bot, would I want a general language model?  Or would I want a language model that had some broad knowledge, but has received some special training to give it deep expertise in psychotherapy? Similarly, if I want a system that writes news articles about religious institutions, do I want a fully general intelligence? Or would it be preferable to train a general model with data specific to religious institutions? The latter seems preferable–and it’s certainly more similar to real-world human intelligence, which is broad, but with areas of deep specialization. Building such an intelligence is a problem we’re already on the road to solving, by using large “foundation models” with additional training to customize them for special purposes. GitHub’s Copilot is one such model; O’Reilly Answers is another.

If a “general AI” is no more than “a model that can do lots of different things,” do we really need it, or is it just an academic curiosity?  What’s clear is that we need better models for specific tasks. If the way forward is to build specialized models on top of foundation models, and if this process generalizes from language models like GPT-3 and O’Reilly Answers to other models for different kinds of tasks, then we have a different set of questions to answer. First, rather than trying to build a general intelligence by making an even bigger model, we should ask whether we can build a good foundation model that’s smaller, cheaper, and more easily distributed, perhaps as open source. Google has done some excellent work at reducing power consumption, though it remains huge, and Facebook has released their OPT model with an open source license. Does a foundation model actually require anything more than the ability to parse and create sentences that are grammatically correct and stylistically reasonable?  Second, we need to know how to specialize these models effectively.  We can obviously do that now, but I suspect that training these subsidiary models can be optimized. These specialized models might also incorporate symbolic manipulation, as Marcus suggests; for two of our examples, psychotherapy and religious institutions, symbolic manipulation would probably be essential. If we’re going to build an AI-driven therapy bot, I’d rather have a bot that can do that one thing well than a bot that makes mistakes that are much subtler than telling patients to commit suicide. I’d rather have a bot that can collaborate intelligently with humans than one that needs to be watched constantly to ensure that it doesn’t make any egregious mistakes.

We need the ability to combine models that perform different tasks, and we need the ability to interrogate those models about the results. For example, I can see the value of a chess model that included (or was integrated with) a language model that would enable it to answer questions like “What is the significance of Black’s 13th move in the 4th game of FischerFisher vs. Spassky?” Or “You’ve suggested Qc5, but what are the alternatives, and why didn’t you choose them?” Answering those questions doesn’t require a model with 600 different abilities. It requires two abilities: chess and language. Moreover, it requires the ability to explain why the AI rejected certain alternatives in its decision-making process. As far as I know, little has been done on this latter question, though the ability to expose other alternatives could be important in applications like medical diagnosis. “What solutions did you reject, and why did you reject them?” seems like important information we should be able to get from an AI, whether or not it’s “general.”

Related work from others:  Latest from MIT : Accelerating particle size distribution estimation

An AI that can answer those questions seems more relevant than an AI that can simply do a lot of different things.

Optimizing the specialization process is crucial because we’ve turned a technology question into an economic question. How many specialized models, like Copilot or O’Reilly Answers, can the world support? We’re no longer talking about a massive AGI that takes terawatt-hours to train, but about specialized training for a huge number of smaller models. A psychotherapy bot might be able to pay for itself–even though it would need the ability to retrain itself on current events, for example, to deal with patients who are anxious about, say, the invasion of Ukraine. (There is ongoing research on models that can incorporate new information as needed.) It’s not clear that a specialized bot for producing news articles about religious institutions would be economically viable. That’s the third question we need to answer about the future of AI: what kinds of economic models will work? Since AI models are essentially cobbling together answers from other sources that have their own licenses and business models, how will our future agents compensate the sources from which their content is derived? How should these models deal with issues like attribution and license compliance?

Finally, projects like Gato don’t help us understand how AI systems should collaborate with humans. Rather than just building bigger models, researchers and entrepreneurs need to be exploring different kinds of interaction between humans and AI. That question is out of scope for Gato, but it is something we need to address regardless of whether the future of artificial intelligence is general or narrow but deep. Most of our current AI systems are oracles: you give them a prompt, they produce an output.  Correct or incorrect, you get what you get, take it or leave it. Oracle interactions don’t take advantage of human expertise, and risk wasting human time on “obvious” answers, where the human says “I already know that; I don’t need an AI to tell me.”

There are some exceptions to the oracle model. Copilot places its suggestion in your code editor, and changes you make can be fed back into the engine to improve future suggestions. Midjourney, a platform for AI-generated art that is currently in closed beta, also incorporates a feedback loop.

In the next few years, we will inevitably rely more and more on machine learning and artificial intelligence. If that interaction is going to be productive, we will need a lot from AI. We will need interactions between humans and machines, a better understanding of how to train specialized models, the ability to distinguish between correlations and facts–and that’s only a start. Products like Copilot and O’Reilly Answers give a glimpse of what’s possible, but they’re only the first steps. AI has made dramatic progress in the last decade, but we won’t get the products we want and need merely by scaling. We need to learn to think differently.

Similar Posts