2 Comments
⭠ Return to thread

Hi, thanks for keep sharing.

The real question is not as much whether a LLM trained of text acquires the same world model as humans, but rather if the generic prediction capabilities of LLMs can also work with sensory-motor data streams besides text.

Experiments with transformers integrating image data with text look promising in this regard.

To fully answer the question, the challenge would lie in designing/engineer the "correct" encoding of such streams and the (simulated I guess) body-world-perception-action "experiencing" to feed a LLM.

Even your work on short-sequence world models suggest that everything (not only words) when encoded (and remembered) as a set of short sequences ("perceptual phrases" if you like), the agent has to learn the sequences "stitching" or proximity relationships in order to build a convincing world model.

So I wouldn't wager yet on LLMs ability to build a world model, if it is fed with the right training data.

Expand full comment

What you mean by LLM is a transformer. Whether a multimodal transformer is sufficient is an entirely different question. My current thinking is that it is not.

Expand full comment