undefined

points

[-]

Treating attention as a lookup operation is popular among computational complexity theorists (e.g. https://arxiv.org/abs/2310.03817 ) because it's easier to work with when you're explicitly constructing a transformer to perform a particular computation, just to demonstrate that tranformers can, in theory, perform it. That's also why there are no training details: the weights are computed directly and not trained.

by D-Machine3 hours ago|

parent|

[-]

This is a good link and important (albeit niche) qualification.

It is hard to square with the article's claims about differentiability and otherwise lack of clarity / obscurantism about what they are really doing here (they really are just compiling / encoding a simple computer / VM into a slightly-modified transformer, which, while cool, is really not what they make it sound like at all).

by dnautics4 hours ago|

prev|

[-]

> lookup tables anymore, that was never a good analogy in the first place

good analogy otherwise, wasn't hash tables the motivation for the kv tables?

by D-Machine3 hours ago|

parent|

[-]

Well, one can never be sure what the real motivation for a lot of DL advances, as most papers are post-hoc obscurantism / hand-waving or even just outright nonsense (see: internal covariate shift explanations for batch norm, which arguably couldn't be more wrong https://arxiv.org/pdf/1805.11604).

When you really get into this stuff, you tend to see the real motivations as either e.g. kernel smoothing (see comments / discussion at https://news.ycombinator.com/item?id=46357675#46359160) or as encoding correlations / feature similarities / multiplicative interactions (see e.g. broad discussion at https://news.ycombinator.com/item?id=46523887). IMO most insights in LLM architectures and layers tends to come from intuitions about projections, manifolds, dimensionality, smoothing/regularization, overparameterization, matrix conditioning, manifold curvature and etc.

There are almost zero useful understandings or insights to be gained from the lookup-table analogy, and most statistical explanations in papers are also post-hoc and require assumptions (convergence rates, infinite layers, etc) that are never shown to clearly hold for actual models that people use. Obviously these AI models work very well for a lot of tasks, but our understanding of why they do is incredibly poor and simplistic, for the most part.

Of course, this is just IMO, and you can see some people in the linked threads do seem to find the lookup table analogies useful. I doubt such people have spent much time building novel architectures, experimenting with different layers, or training such models.