Will future deep learning models with more parameters and trained on more examples avoid the
moving the goalposts? Either way, what’s at stake?
It depends very much on the question. There’s the cognitive science question of whether humans think and speak the way GPT-3 and other deep-learning neural network models do. And there’s the engineering question of whether the way to develop better, humanlike AI is to upscale deep learning models (as opposed to incorporating different mechanisms, like a knowledge database and propositional reasoning).
The questions are, to be sure, related: If a model is incapable of duplicating a human feat like language understanding, it can’t be a good theory of how the human mind works. Conversely, if a model flubs some task that humans can ace, perhaps it’s because it’s missing some mechanism that powers the human mind. Still, they’re not the same question: As with airplanes and other machines, an artificial system can duplicate or exceed a natural one but work in a different way.
Apropos the scientific question, I don’t see the Marcus-Davis challenges as benchmarks or long bets that they have to rest their case on. I see them as scientific probing of an empirical hypothesis, namely whether the human language capacity works like GPT-3. Its failures of common sense are one form of evidence that the answer is “no,” but there are others—for example, that it needs to be trained on half a trillion words, or about 10,000 years of continuous speech, whereas human children get pretty good after 3 years. Conversely, it needs no social and perceptual context to make sense of its training set, whereas children do (hearing children of deaf parents don’t learn spoken language from radio and TV). Another diagnostic is that baby-talk is very different from the output of a partially trained GPT. Also, humans can generalize their language skill to express their intentions across a wide range of social and environmental contexts, whereas GPT-3 is fundamentally a text extrapolator (a task, incidentally, which humans aren’t particularly good at). There are surely other empirical probes, limited only by scientific imagination, and it doesn’t make sense in science to set up a single benchmark for an empirical question once and for all. As we learn more about a phenomenon, and as new theories compete to explain it, we need to develop more sensitive instruments and more clever empirical tests. That’s what I see Marcus and Davis as doing.
Regarding the second, engineering question of whether scaling up deep-learning models will “get us to Artificial General Intelligence”: I think the question is probably ill-conceived, because I think the concept of “general intelligence” is meaningless. (I’m not referring to the psychometric variable
g, also called “general intelligence,” namely the principal component of correlated variation across IQ subtests. This is a variable that aggregates many contributors to the brain’s efficiency such as cortical thickness and neural transmission speed, but it is not a mechanism (just as “horsepower” is a meaningful variable, but it doesn’t explain how cars move.) I find most characterizations of AGI to be either circular (such as “smarter than humans in every way,” begging the question of what “smarter” means) or mystical—a kind of omniscient, omnipotent, and clairvoyant power to solve any problem. No logician has ever outlined a normative model of what general intelligence would consist of, and even Turing swapped it out for the problem of fooling an observer, which spawned 70 years of unhelpful reminders of how easy it is to fool an observer.
If we do try to define “intelligence” in terms of mechanism rather than magic, it seems to me it would be something like “the ability to use information to attain a goal in an environment.” (“Use information” is shorthand for performing computations that embody laws that govern the world, namely logic, cause and effect, and statistical regularities. “Attain a goal” is shorthand for optimizing the attainment of
multiple goals, since different goals trade off.) Specifying the goal is critical to any definition of intelligence: a given strategy in basketball will be intelligent if you’re trying to win a game and stupid if you’re trying to throw it. So is the environment: a given strategy can be smart under NBA rules and stupid under college rules.
Since a goal itself is neither intelligent or unintelligent (Hume and all that), but must be exogenously built into a system, and since no physical system has clairvoyance for all the laws of the world it inhabits down to the last butterfly wing-flap, this implies that there are as many intelligences as there are goals and environments. There will be no omnipotent superintelligence or wonder algorithm (or singularity or AGI or existential threat or foom), just better and better gadgets.
In the case of humans, natural selection has built in multiple goals—comfort, pleasure, reputation, curiosity, power, status, the well-being of loved ones—which may trade off, and are sometimes randomized or inverted in game-theoretic paradoxical tactics. Not only does all this make psychology hard, but it makes human intelligence a dubious benchmark for artificial systems. Why would anyone
want to emulate human intelligence in an artificial system (any more than a mechanical engineer would want to duplicate a human body, with all its fragility)? Why not build the best possible autonomous vehicle, or language translator, or dishwasher-emptier, or baby-sitter, or protein-folding predictor? And who cares whether the best autonomous vehicle driver would be, out of the box, a good baby-sitter? Only someone who thinks that intelligence is some all-powerful elixir.
Back to GPT-3, DALL-E, LaMDA, and other deep learning models: It seems to me that the question of whether or not they’re taking us closer to “Artificial General Intelligence” (or, heaven help us, “sentience”) is based not on any analysis of what AGI would consist of but on our being gobsmacked by what they can do. But refuting our intuitions about what a massively trained, massively parameterized network is capable of (and I’ll admit that they refuted mine) should not be confused with a path toward omniscience and omnipotence. GPT-3 is unquestionably awesome at its designed-in goal of extrapolating text. But that is not the main goal of human language competence, namely expressing and perceiving intentions. Indeed, the program is not even set up to input or output intentions, since that would require deep thought about how to represent intentions, which went out of style in AI as the big-data/deep-learning hammer turned every problem into a nail. That’s why no one is using GPT-3 to answer their email or write an article or legal brief (except to show how well the program can spoof one).
So is Scott Alexander right that every scaled-up GPT-
n will avoid the blunders that Marcus and Davis show in GPT-(
n-1)? Perhaps, though I doubt it, for reasons that Marcus and Davis explain well (in particular, that astronomical training sets at best compensate for their being crippled by the lack of a world model). But even if they do, that would show neither that human language competence is a GPT (given the totality of the relevant evidence) nor that GPT-
n is approaching Artificial General Intelligence (whatever that is).
(
This letter and what follows was originally published in Shtetl-Optimized.)