The real question is not as much whether a LLM trained of text acquires the same world model as humans, but rather if the generic prediction capabilities of LLMs can also work with sensory-motor data streams besides text.
Experiments with transformers integrating image data with text look promising in this rega…
The real question is not as much whether a LLM trained of text acquires the same world model as humans, but rather if the generic prediction capabilities of LLMs can also work with sensory-motor data streams besides text.
Experiments with transformers integrating image data with text look promising in this regard.
To fully answer the question, the challenge would lie in designing/engineer the "correct" encoding of such streams and the (simulated I guess) body-world-perception-action "experiencing" to feed a LLM.
Even your work on short-sequence world models suggest that everything (not only words) when encoded (and remembered) as a set of short sequences ("perceptual phrases" if you like), the agent has to learn the sequences "stitching" or proximity relationships in order to build a convincing world model.
So I wouldn't wager yet on LLMs ability to build a world model, if it is fed with the right training data.
What you mean by LLM is a transformer. Whether a multimodal transformer is sufficient is an entirely different question. My current thinking is that it is not.
Hi, thanks for keep sharing.
The real question is not as much whether a LLM trained of text acquires the same world model as humans, but rather if the generic prediction capabilities of LLMs can also work with sensory-motor data streams besides text.
Experiments with transformers integrating image data with text look promising in this regard.
To fully answer the question, the challenge would lie in designing/engineer the "correct" encoding of such streams and the (simulated I guess) body-world-perception-action "experiencing" to feed a LLM.
Even your work on short-sequence world models suggest that everything (not only words) when encoded (and remembered) as a set of short sequences ("perceptual phrases" if you like), the agent has to learn the sequences "stitching" or proximity relationships in order to build a convincing world model.
So I wouldn't wager yet on LLMs ability to build a world model, if it is fed with the right training data.
What you mean by LLM is a transformer. Whether a multimodal transformer is sufficient is an entirely different question. My current thinking is that it is not.