upvote
> i have a pretty good understanding of how transformers work but this did not make sense to me. also i dont understand why this strategy is applicable only to "code tokens"

Yes, there is a monstrous lack of detail here and you should be skeptical about most of the article claims. The language is also IMO non-standard (serious people don't talk about self-attention as lookup tables anymore, that was never a good analogy in the first place) and no good work would just use language to express this, there would also be a simple equation showing the typical scaled dot-product attention formula, and then e.g. some dimension notation/details indicating which matrix (or inserted projection matrix) got some dimension of two somewhere, otherwise, the claims are inscrutable (EDIT: see edit below).

There are also no training details or loss function details, both of which would be necessary (and almost certainly highly novel) to make this kind of thing end-to-end trainable, which is another red flag.

EDIT: The key line seems to be around:

    gate, val = ff_in(x).chunk(2, dim=-1)
and related code, plus the lines "Notice: d_model = 36 with n_heads = 18 gives exactly 2D per head" but, again, this is very unclear and non-standard.
reply
Treating attention as a lookup operation is popular among computational complexity theorists (e.g. https://arxiv.org/abs/2310.03817 ) because it's easier to work with when you're explicitly constructing a transformer to perform a particular computation, just to demonstrate that tranformers can, in theory, perform it. That's also why there are no training details: the weights are computed directly and not trained.
reply
This is a good link and important (albeit niche) qualification.

It is hard to square with the article's claims about differentiability and otherwise lack of clarity / obscurantism about what they are really doing here (they really are just compiling / encoding a simple computer / VM into a slightly-modified transformer, which, while cool, is really not what they make it sound like at all).

reply
> lookup tables anymore, that was never a good analogy in the first place

good analogy otherwise, wasn't hash tables the motivation for the kv tables?

reply
Well, one can never be sure what the real motivation for a lot of DL advances, as most papers are post-hoc obscurantism / hand-waving or even just outright nonsense (see: internal covariate shift explanations for batch norm, which arguably couldn't be more wrong https://arxiv.org/pdf/1805.11604).

When you really get into this stuff, you tend to see the real motivations as either e.g. kernel smoothing (see comments / discussion at https://news.ycombinator.com/item?id=46357675#46359160) or as encoding correlations / feature similarities / multiplicative interactions (see e.g. broad discussion at https://news.ycombinator.com/item?id=46523887). IMO most insights in LLM architectures and layers tends to come from intuitions about projections, manifolds, dimensionality, smoothing/regularization, overparameterization, matrix conditioning, manifold curvature and etc.

There are almost zero useful understandings or insights to be gained from the lookup-table analogy, and most statistical explanations in papers are also post-hoc and require assumptions (convergence rates, infinite layers, etc) that are never shown to clearly hold for actual models that people use. Obviously these AI models work very well for a lot of tasks, but our understanding of why they do is incredibly poor and simplistic, for the most part.

Of course, this is just IMO, and you can see some people in the linked threads do seem to find the lookup table analogies useful. I doubt such people have spent much time building novel architectures, experimenting with different layers, or training such models.

reply