This is a very interesting hypothesis that could be quite true for living beings. What I disagree with is that having an animal-like body is necessary for the process of forming a world model. A simulation could be sufficient. And there is already work on that front. (Also, I would not characterize deep-learning-based AI as trying to form propositional knowledge. In fact, its great performance partly stems from not dealing with propositional knowledge directly.)
If a body is in fact necessary, PaLM-E could be paving a way toward it as well. https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...