I honestly can't comment with certainty that training from videos alone and whatever tokenization scheme they're using will ever get perfect dynamics.
However it is worth noting that transformers can do a pretty good job at learning dynamics with the right pipeline (not video): https://arxiv.org/pdf/2605.15305 https://arxiv.org/pdf/2605.09196
My point here being that representationally, it might be possible to learn good dynamics without a radically different approach/arch. There are already models that extract 3D tracking points from videos, so they could possibly be leveraged for learning dynamics (which on its own gives precedent for end-to-end approaches also possibly working).
* You could instruct your LLM to interact with a simulator to run experiments and infer behaviour
* You could edit the transformer model and inject spatially relevant data rather than text as is done in above paper
* You could change the architecture to be more condusive for representating a world state. I.e., LeCun's JEPA world model.
* You could further enhance some of the above by using a differentiable physics engine (eg. NVIDIA Newton) to calculate losses directly.
But at the end of the day if a model has any hope to always produce realistic physics, it HAS to learn the laws of nature in some form or other. It looks to me that the next big leap could be achieved by combining the last two approaches.
P.S.: I like discussing such topics. If anyone knows a forum or discord with like-minded people, please let me know :)
I’ve often thought it would be very handy to have a proper simulator for being able to simulate and identify inefficiencies in one’s technique, but no idea whether it would be feasible to do.
Proper simulators for those exist, you essentially need an engine with a compliant contact model. MuJoCo is the goto here, see:
https://mujoco.readthedocs.io/en/stable/modeling.html#muscle... https://mujoco.readthedocs.io/en/stable/computation/fluid.ht...
These explicitly model biological muscles. IIRC it was originally created to model human hands (I could be misremembering though).
Really depends on the fidelity you want.
Edit: I also work in rigid body simulation for robotics.
Robotics folks probably want speed and accuracy. I'm from the video game industry so I generally look for speed and stability.
Note: This is a loose analogy and recent techniques are already blurring the lines between these axis.
We were sharing game clips with each other and after a while realised our old clips were just gone, being deleted after 30 or 90 days or something.