upvote
I didn't look at all the details, but wanted to see how you did the initial embedding and see you do have a 14x5 matrix there. I guess when you are setting things by-hand (rather than learning), the definition of counting "parameters" is a bit unclear. One could say all those are parameters! even if setting in a straight-forward way.
reply
Yeah basically it is an implementation detail but most of them are zero, there is an equivalent 14 parameter sparse matrix for that.
reply
I ask this question as someone who can't do much more than confirm that your blog post is written in English by someone who knows math.

Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)

reply
I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. Also there are smaller ones that were trained so would still be more like 311/36 ~= 8.6.
reply
>I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling.

True, but with even smarter humans, you could exploit the interactions for additional calculations.

While it sounds a bit silly, it is one of the hypotheses behind a fast takeoff. An AI that is sufficiently smart could design a network better than a trained one and could make something much smarter than itself on the same hardware. The question then becomes if that new smarter one can do an even better job. I suspect diminishing returns, but then again I am insufficiently smart.

reply
Yeah that is plausible enough.
reply
Thanks!

(I see the Trained Weights results now, thanks.)

reply