It is hard to square with the article's claims about differentiability and otherwise lack of clarity / obscurantism about what they are really doing here (they really are just compiling / encoding a simple computer / VM into a slightly-modified transformer, which, while cool, is really not what they make it sound like at all).
good analogy otherwise, wasn't hash tables the motivation for the kv tables?
When you really get into this stuff, you tend to see the real motivations as either e.g. kernel smoothing (see comments / discussion at https://news.ycombinator.com/item?id=46357675#46359160) or as encoding correlations / feature similarities / multiplicative interactions (see e.g. broad discussion at https://news.ycombinator.com/item?id=46523887). IMO most insights in LLM architectures and layers tends to come from intuitions about projections, manifolds, dimensionality, smoothing/regularization, overparameterization, matrix conditioning, manifold curvature and etc.
There are almost zero useful understandings or insights to be gained from the lookup-table analogy, and most statistical explanations in papers are also post-hoc and require assumptions (convergence rates, infinite layers, etc) that are never shown to clearly hold for actual models that people use. Obviously these AI models work very well for a lot of tasks, but our understanding of why they do is incredibly poor and simplistic, for the most part.
Of course, this is just IMO, and you can see some people in the linked threads do seem to find the lookup table analogies useful. I doubt such people have spent much time building novel architectures, experimenting with different layers, or training such models.