Ingredients of understanding
Thoughts on how human understanding is different from LLM "understanding"
Consider this natural language prompt:
I removed both wheels of my bicycle. How do I make it stand upright on the floor?1
How did you, as a human, answer this question? Most likely you imagined how a bike looked after wheel-removal. You then imagined balancing this bike and realized that it can be supported on the floor by the fork and the crank case. Or maybe you concluded differently, despite following a similar process.
Where did the knowledge for running this imagination — mental simulation — come from? Did you acquire it from reading about balancing bike frames, or did it come from your sensorimotor experience? For most, this knowledge is acquired through sensorimotor experience in the real world. Even for a person who has never mastered language, their experience with physical objects combined with their knowledge of bike geometry is sufficient to run this mental simulation.
The process a human goes through in processing a natural language prompt is very different from the process a large language model (LLM) goes through. Here are some salient points about how language is understood in humans:
Language understanding involves mental simulations on a world-model.
This world-model cannot be acquired from language alone.
Language is a mechanism to control mental simulations and world models in other humans. Internal monologue is a special case of this.
As shown in the figure below, language is a thin layer that indexes into the sensorimotor simulators that constitute the majority of our world model. A good chunk of this world model can be learned without language, although language can definitely help. Thinking and imagining often requires coordination of the linguistic and non-linguistic simulators. Of course, some questions can be answered quickly and correctly purely in the language system without having to ‘descend’ down into the sensorimotor simulators.
Mental simulations are situated in sensorimotor context.
In the bike-balancing example, the prompt was purely in language, and you could run mental simulations with your eyes closed. But in general, mental simulations need to absorb the current sensorimotor context to run the correct simulation.
I have a favorite example for how language simulations are situated, thanks to my colleague Felix Hill. If you hear the sentence “The haystack was important because the cloth ripped”, it might not make sense to you at all. But if you hear that sentence in the context of the picture at the end of this section, it will make immediate sense even though there is no haystack in the picture.
If you were to find yourself in that unfortunate cloth-ripped situation, you want to be running contextually appropriate mental simulations that combine perception, sensorimotor experience, and conceptual knowledge. If you were lucky to have a lake as another option, your acquired-via-language knowledge about crocodiles in that lake might temper your enthusiasm. Thankfully, these kinds of mental simulations are always happening in our brains to help make decisions and drive behavior, with our without language.
A rich world model cannot be acquired from language alone
It is not practical to acquire a human-like rich world model through language alone because it is not possible to convert all the sensorimotor details into language efficiently2.
Imagine accidentally dropping an object you were holding. Where you’d look for that object depends on the particulars of the physical context you were in. What was the geometry of the objects around? Could the object bounce? Could it roll off and fall through a crack? It is impossible to describe the scene in all the detail in language because an a-priori unimportant detail could become crucial in this particular context — maybe the object was light and could be blown away by the wind coming through an open window. It is not possible to convey just the relevant details either because deciding what is relevant itself requires a contextually appropriate mental simulation. Without the proper context, an LLM’s answer — look for the object on the floor — is too generic and ungrounded.
While it is quite interesting that transformers seem to acquire implicit world models for games like Othello from text alone, that should not be considered as evidence for language-only models getting to human-like rich and dynamic world models.
Real-world commonsense is not a language-only problem.
Winograd schemas — language problems that humans typically solve by imagining the physical objects — were originally created as tests for similar capabilities in language models. LLMs now successfully solve many Winograd schemas, and this has got some people thinking that the commonsense problem is largely on the path to being solved using language alone. Word-to-word coherence can solve some commonsense query problems may have been a surprise to many, but that surprise doesn’t justify the conclusion that word-to-word coherence will solve all of commonsense.
In general, it is painfully laborious to convert physical commonsense scenarios into natural language. Winograd schemas were clever and popular because they were examples of commonsense questions that were easy to pose in language without appearing too contrived. The lesson to be taken away from this is not that LLMs have solved commonsense — the lesson might be that commonsense questions that are easy to pose in language might also be “solved” using language alone.
It is not that new commonsense language queries that defeat language models do not exist — they do. It is just that those queries will look increasingly more contrived when expressed in language. This is not because the scenarios themselves are contrived or infrequent — it is just unusual for humans to express such scenarios in language.
Sensorimotor inputs are not language.
In a simplified sense, language is compressed code that indexes into a sensorimotor codebook that is shared between the sender and the receiver. Unlike a typical Shannon-like communication system, human language has the additional complexity of feedback (receiver can ask questions), and adaptivity (codebook itself can change based on prior transmissions). Treating sensory input as “language” doesn’t achieve anything because if that sensory information is compressible and shared experience among multiple agents, then a shared codebook and a new language will be formed on top of it.
Minimal machinery for understanding
Here’s a list of what I think is the minimal set of ingredients for the machinery of understanding.
Ability to construct rich sensorimotor world models through observations and interactions, and ability to query them in context-appropriate ways. This world model should be: 1) planning compatible, 2) causally structured, 3) rapidly modifiable, and 4) support counterfactual simulations.
Ability to modify these world models by thinking
Ability to seek information based on current models and uncertainty, both for modifying the models and for making decisions.
Ability to generate a hypothesis based on the world model and to test that out in the real world.
Of course language amplifies the effects of this core machinery by helping us rapidly acquire knowledge generated by other humans over the years and to rapidly share any new knowledge/understanding we develop through the exercise of this machinery. Note that language itself was a product of agents with this machinery interacting with each other. My view is that understanding preceded language3.
The idea that humans share an understanding machinery doesn't contradict the fact that different people can have different levels of understanding of the same concept. A professional mathematician’s understanding of the concept of a ‘vector’ is richer than that of an average high schooler, purely because the mathematician has applied the understanding machinery to the concept of vector in multitudes of contexts to build a richer model. The high schooler can still get to the same level of understanding by going through a similar process — but a language-only model would not develop human-like world models no matter how much text it reads.
Tests designed to probe the mastery of a subject in humans assume that this understanding machinery and process is shared. When we say a child as having understood something, we assume that the child utilized its understanding machinery and went through a process of model-building and simulation. Understanding is both the current state of knowledge, and also the process one needs to go through to reach and update that knowledge. In humans both these are interconnected, and we take this for granted in our interactions with other humans, and in our tests of their understanding.
Human-like understanding is worth understanding
Of course one could argue that LLM has an understanding that is superior than that of humans and therefore we should not care for human-like understanding. But it could also be like settling for balloon flight as an alternative for heavier-than-air flight, as I argued in a previous article. If human-like understanding is fundamentally different, it is worth knowing why and how, both as a scientific puzzle and as a challenge for building smarter machines. My hope is that we continue investigating until we really understand what constitutes understanding.
Further Reading
My twitter thread on MalayaLLM thought experiment.
Barasalou’s Perceptual Symbol Systems paper.
Our work on cognitive programs: An example of bringing perceptual simulations into abstract concepts.
More about commonsense, general intelligence and the brain: From CAPTCHA to commonsense
In case you were wondering how GPT-4 answers this question.
This is not a question of whether it can be done in theory in the infinite token limit — it just cannot be done in practice because it is not efficient. Moreover it will not be done in practice because multi-modal systems will prove to be better than language-only systems.
See also: https://aeon.co/essays/imagination-is-such-an-ancient-ability-it-might-precede-language
Yes, LLM's do not have the human-level capacity to produce and utilize mental models. And it seems likely that if AGI is to be developed, it will have to include such a capacity for mental modelling.
How might such a capacity be installed in AI? A recent paper of mine that was published in the Journal Biosystems outlines how mental modelling evolved and develops in humans. This understanding can facilitate insights into how this capacity could be incorporated into AI. The paper titled "The Evolution and Development of Consciousness: The Subject-Object emergence Hypothesis" is freely available here: https://www.sciencedirect.com/science/article/pii/S0303264722000752
Awesome article, my question is does abstract concepts or reasoning share more structure with sensorimotor domain or language domain?