Like for instance, think chess engines with AI, they can train themselves simply by playing many many games, the "world simulation" with those is the classic chess engine architecture but it uses the positional weights produced by the neural network, so says gemini anyways:
"ai chess engine architecture"
"Modern AI chess engines (e.g., Lc0, Stockfish) use a hybrid architecture combining deep neural networks for positional evaluation with advanced search algorithms like Monte-Carlo Tree Search (MCTS) or alpha-beta pruning. They feature three core components: a neural network (often CNN-based) that analyzes board patterns (matrices) to evaluate positions, a search engine that explores move possibilities, and a Universal Chess Interface (UCI) for communication."
So with no model of the world to play with, I'm thinking the chatbot llms can only go with probabilities or what matches the prompt best in the crazy dimensional thing that goes on inside the neural networks. If it had access to a simple world of cars and car washes, it could run a simulation and rank it appropriately, and also could possibly infer through either simulation or training from those simulations that if you are washing a car, the operation will fail if the car is not present. I really like this car wash trick question lol