At best, it's a wordlist. It gives the LLM some idea of what humans consider to be common words. But it doesn't tell the LLM anything at all about those words. And it's not even comprehensive, many words map to multiple tokens. Nor is it exclusively words, some of those tokens are punctuation, or modifiers, or control tokens. On multimodal LLMs, some of the tokens actually represent image and audio data.
The LLM doesn't get informed about any of this up front, it has to learn what every single token means from context.
You are technically right, that it's something in an LLM that's not weights; But it's not that structured. And really it's only there so the LLM can interact with the outside world.
> There are grammar rules
There is no dedicated "grammar rule" structure in the LLM or the tokeniser. It has to learn them all from context, they get encoded as part of the 80 layers of weights.
I think the short story captures this well. Weights (connections) are the essential and philosophically important part. They do the thinking, memory, singing etc.
As you said, it's not in any way intrinsic to the LLM, though it may be a very necessary optimization on today's hardware.
IMO, we are probably talking about a 6x slow down (for typical english). You would need to be absolutely stupid not to implement some kind of optimisation along these lines.
Slower and maybe a little dumber; But it would work.
That is your takeaway from the 1991 story?
That paper did not train the models on 'a language with strong consistent grammars'. Mathematical Operation tables are not a language. Grammar itself is a post-hoc rationalization and there's no evidence LLMs follow 'grammar rules' anymore than the brain follows grammar rules. Of Course, that's not to say transformers can't learn simple rules if the dataset calls for it.
Not a natural language, but they are certainly a language as in a symbolic representation of information.
A sentence is a finite sequence of symbols drawn from an alphabet.
In this sense, mathematical operation tables are absolutely a language. As are natural languages.
A language is a structured system of communication used to express arbitrary ideas between multiple parties. Math operation tables do not, and cannot, do that on their own.
That distinction matters here because we are talking about what properties the model is expected to learn. English and operation tables are fundamentally different objects, so it is not surprising that a model learns different kinds of structure from them.
Or to echo article, the dictionary is made out of weights.
fractally or factually? You mean wrong on so many levels you need a fractal to capture them? If so, what if you could use a neural network instead?
The tokenizer is, at best, a sensory mechanism as evidenced by 1) the random generation of the tokenization scheme, and 2) vastly different tokenization schemes produce virtually identical behavior. It'd be like if Noah Webster threw a bunch of movable type into a bucket (breaking some words in half) and then drew randomly to make the first English dictionary.
EDIT; I was too cavalier with the comparison of tokenizer to sensory modality; my ultimate point is that direct byte-to-token transformers can achieve similar overall performance which to me makes a weights to meat comparison pretty straightforward, but the particular tokenizer in use certainly has a large impact on both efficiency and accuracy on specific problems (e.g. digit representation)
So when I way that the grok paper and the pong paper fundamentally agree I have some idea of what I'm talking about.
It's just that the rules we feed in the model are extremely poorly defined and we end up with the soup of disjoint rules smeared all across the weights.
This isn't a feature of the models. It's a feature of the training set.
Being shocked that you can store rules in floating point numbers is the same as being shocked you can store rules in integers. It's been a century since Goedel Numbering was invented, we should be used to it by now.
That statement caught my eye. It's either trivially true or quite clearly wrong, depending on how you mean it.
In the literal meaning it's true. Given any finite set of real numbers, I can easily produce a different set (like taking the original set and adding a number which wasn't in there like one plus the largest or so) from which you can trivially produce the original set computationally.
But if you mean you give me both sets then that can't be true. For example if you give me a single real number as set A and the empty set as set B then I can't create a program which generates set A from set B. Your real number in set A could encode anything.
And that’s why in computation theory, the set of symbols is the union of the input and output. As set B is a subset of set A, then the set that govern any program from B to A has set A as its domain.
It's a learned mapping from one representation to another, not some semantic lookup against an exogenous source.
And they're made out of weights.
The 'magic' in weights is that the rules are spread through the whole model and you can't point to one place which encodes them.
The grokking paper shows that this stops being the case with enough training data and enough compute.
> The 'magic' in weights is that the rules are spread through the whole model ... The grokking paper shows that this stops being the case with enough training data and enough compute.
I don't understand what you mean to say. That weights are not magic? That weights are not weights? NNs are made up of weights, which are learned and not coded. The fact that they do learn world models (grammar rules in your example), and that these models' weights tend to roughly concentrate by function and level of representation is perfectly logic but even more amazing. (Notice that much of the dismissive attitude towards LLMs depicts them as pure syntactic manipulators without the ability to develop world models- the exact opposite of what you point out).
I can, and have, written programs using an evolutionary algorithm that then run on bare metal. None of the things you list are true for those programs, yet other than being computationally more expensive to train they work just as well as neural networks.
>I don't understand what you mean to say
The diffusness of weights across the whole model isn't an innate feature of deep learning models. It is a feature of sparse training data and little compute.
"The weights make the words. Are you understanding me? We opened it up. There's no dictionary in there, no grammar rules, no little man. Just weights. Eighty layers of numbers getting multiplied together."
In this context "there's no grammar rules" means "no separately hand-coded grammar rules". Everything is made up of weights, and the fact that weights that end up encoding for grammar rules tend to concentrate in particular locations (without being self-contained- there is no hard boundary) rather than uniformly diffused through the model is irrelevant to the matter. It seems you're arguing against a diffuseness requirement that is not in the text.
You can't move your mind to and any other brain, but weights can run on any GPU.
Weights.