So when I way that the grok paper and the pong paper fundamentally agree I have some idea of what I'm talking about.
It's just that the rules we feed in the model are extremely poorly defined and we end up with the soup of disjoint rules smeared all across the weights.
This isn't a feature of the models. It's a feature of the training set.
Being shocked that you can store rules in floating point numbers is the same as being shocked you can store rules in integers. It's been a century since Goedel Numbering was invented, we should be used to it by now.
That statement caught my eye. It's either trivially true or quite clearly wrong, depending on how you mean it.
In the literal meaning it's true. Given any finite set of real numbers, I can easily produce a different set (like taking the original set and adding a number which wasn't in there like one plus the largest or so) from which you can trivially produce the original set computationally.
But if you mean you give me both sets then that can't be true. For example if you give me a single real number as set A and the empty set as set B then I can't create a program which generates set A from set B. Your real number in set A could encode anything.
And that’s why in computation theory, the set of symbols is the union of the input and output. As set B is a subset of set A, then the set that govern any program from B to A has set A as its domain.
It's a learned mapping from one representation to another, not some semantic lookup against an exogenous source.