Thank you for the comment. I don't disagree that many useful world models can be created from language alone. (But I'd still be cautious reading too much into the current model capabilities .I have played around with the spatial/directions examples in pre-trained models. I don't think they are that robust. I think I have many test cases …
Thank you for the comment. I don't disagree that many useful world models can be created from language alone. (But I'd still be cautious reading too much into the current model capabilities .I have played around with the spatial/directions examples in pre-trained models. I don't think they are that robust. I think I have many test cases to show that they don't have a consistent spatial reasoning).
Also my comment was about practical impossibility, not theoretical impossibility. Lempel-Ziv algorithm is a good language model. Can asymptotically approach any other language model's performance. We can even theoretically prove the asymptotic optimality. But in practice the rate is just too slow. If we poke around in LZ, we might see evidence of world-models in it too. And since we have transformers which are more efficient learners, we won't be attempting to scale up LZ. So while I consider the implicit world models in transformers to be interesting, I don't consider them as evidence of learning all world models efficiently through language. I think we'll come up with new multi-modal architecture that are more efficient learners, and act as grounding constraints for language. If approach A is "get everything into language form" and approach B is "multi-modal grounded language model", I think B will eventually overtake A before A can completely solve the problem. And once B overtakes A, we won't follow approach A. (And of course, "plannability" of the world models is another aspect in itself, which also hopefully will get tackled in new architectures. )
Thank you for the comment. I don't disagree that many useful world models can be created from language alone. (But I'd still be cautious reading too much into the current model capabilities .I have played around with the spatial/directions examples in pre-trained models. I don't think they are that robust. I think I have many test cases to show that they don't have a consistent spatial reasoning).
Also my comment was about practical impossibility, not theoretical impossibility. Lempel-Ziv algorithm is a good language model. Can asymptotically approach any other language model's performance. We can even theoretically prove the asymptotic optimality. But in practice the rate is just too slow. If we poke around in LZ, we might see evidence of world-models in it too. And since we have transformers which are more efficient learners, we won't be attempting to scale up LZ. So while I consider the implicit world models in transformers to be interesting, I don't consider them as evidence of learning all world models efficiently through language. I think we'll come up with new multi-modal architecture that are more efficient learners, and act as grounding constraints for language. If approach A is "get everything into language form" and approach B is "multi-modal grounded language model", I think B will eventually overtake A before A can completely solve the problem. And once B overtakes A, we won't follow approach A. (And of course, "plannability" of the world models is another aspect in itself, which also hopefully will get tackled in new architectures. )