Yes, LLM's do not have the human-level capacity to produce and utilize mental models. And it seems likely that if AGI is to be developed, it will have to include such a capacity for mental modelling.
How might such a capacity be installed in AI? A recent paper of mine that was published in the Journal Biosystems outlines how mental modelling evolved and develops in humans. This understanding can facilitate insights into how this capacity could be incorporated into AI. The paper titled "The Evolution and Development of Consciousness: The Subject-Object emergence Hypothesis" is freely available here: https://www.sciencedirect.com/science/article/pii/S0303264722000752
thank you. My conjecture (based on some neuroscience evidence) is that reasoning infrastructure is shared between both language and sensorimotor domains. Maybe the language domain gets a few additional tweaks on top of the sensorimotor reasoning. Even everyday sensorimotor behavior involves reasoning, some of which is unconscious. One distinguishing property of reasoning in the language domain might be that it is always conscious.
Agree with this, although my neuroscience is even below elementary , but i was reading Alexander luria's book where he said that external tools like writing, tables of multiplication etc ties down various functional regions of brain, so language basically is knoting various sensorimotors brain regions and shares that with others also. How can transformers do that in vision and language in reliable ways is unknown and maybe 100 Trillion GPT will tell us.
Interesting post. I agree with your general framing: people constantly make reference to mental models that are developed in large part through physical experiences, not acquired from language.
However, I believe that many useful mental models of the physical CAN be constructed from language alone. Assertions that this is impossible typically seem to be based on thinking of the form, "Well, *I* don't see how it can be done, so it must be impossible..." That line of reasoning seems inadequate to me, and I hope you won't fall prey to that!
More compelling to me are experiments with toy models trained on (somewhat contrived) language from which one can extract and validate the learned mental model from the model parameters. Here are a few examples:
- Given a list statements of the form "San Francisco is west of Reno" that give the position of one US city relative to another, a simple model generates a reasonably accurate map, which can be extracted from the model parameters and shown graphically. This seems to be exactly the sort of mental model that people use to reason about the physical world.
- Extending this example, if a model is then given statements of the form "Fargo is in North Dakota", a simple model learns state boundaries. After the city positions and boundaries are learned, the model can quite accurately guess which state contains a city outside this second-stage training data. Again, this "mental model" map can be extracted and displayed graphically. A human might acquire such a model by walking, driving, or looking at a map, but language alone is actually sufficient.
- Given a list of simple arithmetic assertions, even a toy language model develops algorithms to perform basic arithmetic. In some cases, these algorithms are comprehensible enough to extract and explain. This works even though the model starts with no a priori notion of quantity or addition, which a human might acquire through physical experience, e.g. seeing a group of two sheep join a group of three sheep.
In summary, I believe that learning mental models of the physical world from language alone is MUCH HARDER than learning from language together with physical experience, but I've seen no evidence that this is impossible. To the contrary, available evidence proves this is possible in some nontrivial cases, and I expect many more examples will follow.
Thank you for the comment. I don't disagree that many useful world models can be created from language alone. (But I'd still be cautious reading too much into the current model capabilities .I have played around with the spatial/directions examples in pre-trained models. I don't think they are that robust. I think I have many test cases to show that they don't have a consistent spatial reasoning).
Also my comment was about practical impossibility, not theoretical impossibility. Lempel-Ziv algorithm is a good language model. Can asymptotically approach any other language model's performance. We can even theoretically prove the asymptotic optimality. But in practice the rate is just too slow. If we poke around in LZ, we might see evidence of world-models in it too. And since we have transformers which are more efficient learners, we won't be attempting to scale up LZ. So while I consider the implicit world models in transformers to be interesting, I don't consider them as evidence of learning all world models efficiently through language. I think we'll come up with new multi-modal architecture that are more efficient learners, and act as grounding constraints for language. If approach A is "get everything into language form" and approach B is "multi-modal grounded language model", I think B will eventually overtake A before A can completely solve the problem. And once B overtakes A, we won't follow approach A. (And of course, "plannability" of the world models is another aspect in itself, which also hopefully will get tackled in new architectures. )
The real question is not as much whether a LLM trained of text acquires the same world model as humans, but rather if the generic prediction capabilities of LLMs can also work with sensory-motor data streams besides text.
Experiments with transformers integrating image data with text look promising in this regard.
To fully answer the question, the challenge would lie in designing/engineer the "correct" encoding of such streams and the (simulated I guess) body-world-perception-action "experiencing" to feed a LLM.
Even your work on short-sequence world models suggest that everything (not only words) when encoded (and remembered) as a set of short sequences ("perceptual phrases" if you like), the agent has to learn the sequences "stitching" or proximity relationships in order to build a convincing world model.
So I wouldn't wager yet on LLMs ability to build a world model, if it is fed with the right training data.
What you mean by LLM is a transformer. Whether a multimodal transformer is sufficient is an entirely different question. My current thinking is that it is not.
You may add as examples hearing a recipe and then following it, hearing how to reach a distant place and then reaching it - mental models in the beginning and in the end will be different.
I think recent trends of combining LLMs with external tools, like calculators and simulators, is a step towards building that pyramid you drew , but starting from the to. This is the opposite order of evolution, which started with locomotion, then sensation, then planning, and finally communication.
However, it seems that all human philosophers’ work on AI is behind on actual state of affairs.
Let me agree again — LLMs are how author describes manipulation of frames of reference of language, only. Sensorimotor stuff is just not there.
But… LLMs are only a fraction, albeit popular one, of the mind-boggling development of modern AI.
AI is not bound by language, it’s a pattern-seeking tech. We have image and video generation. They work with pixel FoR. We have 3D modeling AI. We have chemical and math science AI. In a way, this tech doesn’t care if we feed it language, or visual, or audio or any other patterns. Actually motorized, multi-PoV devices, are coming.
Then, once we have device that can teuly experiment, with some physical tools that we will give it, that’s really it - it would observe and build FoR in truly multi-modal fashion, and each axis will be way superior of human capacity.
So, I think all analysis that is coming up is description why this one specific part of the elephant is not “it”. Well, yes. We will soon have so many parts…
Yes, LLM's do not have the human-level capacity to produce and utilize mental models. And it seems likely that if AGI is to be developed, it will have to include such a capacity for mental modelling.
How might such a capacity be installed in AI? A recent paper of mine that was published in the Journal Biosystems outlines how mental modelling evolved and develops in humans. This understanding can facilitate insights into how this capacity could be incorporated into AI. The paper titled "The Evolution and Development of Consciousness: The Subject-Object emergence Hypothesis" is freely available here: https://www.sciencedirect.com/science/article/pii/S0303264722000752
thank you or sharing, the evolutionary path is quite interesting to me.
Awesome article, my question is does abstract concepts or reasoning share more structure with sensorimotor domain or language domain?
thank you. My conjecture (based on some neuroscience evidence) is that reasoning infrastructure is shared between both language and sensorimotor domains. Maybe the language domain gets a few additional tweaks on top of the sensorimotor reasoning. Even everyday sensorimotor behavior involves reasoning, some of which is unconscious. One distinguishing property of reasoning in the language domain might be that it is always conscious.
Agree with this, although my neuroscience is even below elementary , but i was reading Alexander luria's book where he said that external tools like writing, tables of multiplication etc ties down various functional regions of brain, so language basically is knoting various sensorimotors brain regions and shares that with others also. How can transformers do that in vision and language in reliable ways is unknown and maybe 100 Trillion GPT will tell us.
Interesting post. I agree with your general framing: people constantly make reference to mental models that are developed in large part through physical experiences, not acquired from language.
However, I believe that many useful mental models of the physical CAN be constructed from language alone. Assertions that this is impossible typically seem to be based on thinking of the form, "Well, *I* don't see how it can be done, so it must be impossible..." That line of reasoning seems inadequate to me, and I hope you won't fall prey to that!
More compelling to me are experiments with toy models trained on (somewhat contrived) language from which one can extract and validate the learned mental model from the model parameters. Here are a few examples:
- Given a list statements of the form "San Francisco is west of Reno" that give the position of one US city relative to another, a simple model generates a reasonably accurate map, which can be extracted from the model parameters and shown graphically. This seems to be exactly the sort of mental model that people use to reason about the physical world.
- Extending this example, if a model is then given statements of the form "Fargo is in North Dakota", a simple model learns state boundaries. After the city positions and boundaries are learned, the model can quite accurately guess which state contains a city outside this second-stage training data. Again, this "mental model" map can be extracted and displayed graphically. A human might acquire such a model by walking, driving, or looking at a map, but language alone is actually sufficient.
- Given a list of simple arithmetic assertions, even a toy language model develops algorithms to perform basic arithmetic. In some cases, these algorithms are comprehensible enough to extract and explain. This works even though the model starts with no a priori notion of quantity or addition, which a human might acquire through physical experience, e.g. seeing a group of two sheep join a group of three sheep.
In summary, I believe that learning mental models of the physical world from language alone is MUCH HARDER than learning from language together with physical experience, but I've seen no evidence that this is impossible. To the contrary, available evidence proves this is possible in some nontrivial cases, and I expect many more examples will follow.
Thank you for the comment. I don't disagree that many useful world models can be created from language alone. (But I'd still be cautious reading too much into the current model capabilities .I have played around with the spatial/directions examples in pre-trained models. I don't think they are that robust. I think I have many test cases to show that they don't have a consistent spatial reasoning).
Also my comment was about practical impossibility, not theoretical impossibility. Lempel-Ziv algorithm is a good language model. Can asymptotically approach any other language model's performance. We can even theoretically prove the asymptotic optimality. But in practice the rate is just too slow. If we poke around in LZ, we might see evidence of world-models in it too. And since we have transformers which are more efficient learners, we won't be attempting to scale up LZ. So while I consider the implicit world models in transformers to be interesting, I don't consider them as evidence of learning all world models efficiently through language. I think we'll come up with new multi-modal architecture that are more efficient learners, and act as grounding constraints for language. If approach A is "get everything into language form" and approach B is "multi-modal grounded language model", I think B will eventually overtake A before A can completely solve the problem. And once B overtakes A, we won't follow approach A. (And of course, "plannability" of the world models is another aspect in itself, which also hopefully will get tackled in new architectures. )
Hi, thanks for keep sharing.
The real question is not as much whether a LLM trained of text acquires the same world model as humans, but rather if the generic prediction capabilities of LLMs can also work with sensory-motor data streams besides text.
Experiments with transformers integrating image data with text look promising in this regard.
To fully answer the question, the challenge would lie in designing/engineer the "correct" encoding of such streams and the (simulated I guess) body-world-perception-action "experiencing" to feed a LLM.
Even your work on short-sequence world models suggest that everything (not only words) when encoded (and remembered) as a set of short sequences ("perceptual phrases" if you like), the agent has to learn the sequences "stitching" or proximity relationships in order to build a convincing world model.
So I wouldn't wager yet on LLMs ability to build a world model, if it is fed with the right training data.
What you mean by LLM is a transformer. Whether a multimodal transformer is sufficient is an entirely different question. My current thinking is that it is not.
Great article! Thank you!
You may add as examples hearing a recipe and then following it, hearing how to reach a distant place and then reaching it - mental models in the beginning and in the end will be different.
If you add primitives related to each facet of sensorimotor activities to my model (https://ling.auf.net/lingbuzz/007345 and https://alexandernaumenko.substack.com/) it could be a good starting point or a boost for your research. I would love to join!
thanks for sharing. looks quite interesting, I'll have a more detailed look later.
I think recent trends of combining LLMs with external tools, like calculators and simulators, is a step towards building that pyramid you drew , but starting from the to. This is the opposite order of evolution, which started with locomotion, then sensation, then planning, and finally communication.
Yes, an it will be much bigger than our puny minds can imagine.
All true, no doubt.
However, it seems that all human philosophers’ work on AI is behind on actual state of affairs.
Let me agree again — LLMs are how author describes manipulation of frames of reference of language, only. Sensorimotor stuff is just not there.
But… LLMs are only a fraction, albeit popular one, of the mind-boggling development of modern AI.
AI is not bound by language, it’s a pattern-seeking tech. We have image and video generation. They work with pixel FoR. We have 3D modeling AI. We have chemical and math science AI. In a way, this tech doesn’t care if we feed it language, or visual, or audio or any other patterns. Actually motorized, multi-PoV devices, are coming.
Then, once we have device that can teuly experiment, with some physical tools that we will give it, that’s really it - it would observe and build FoR in truly multi-modal fashion, and each axis will be way superior of human capacity.
So, I think all analysis that is coming up is description why this one specific part of the elephant is not “it”. Well, yes. We will soon have so many parts…