MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.
It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.
plugs in knowledge bank LLM: ... I know kung fu.
Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.
I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.
I have done some interesting work on applying multiple layer duplications in different regions of the model too, going so far as to train a meta-model (actually just XGBoost) to predict the merges. Seems to work, buts thats a whole other blog post.
This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...
Normal:
L1 -> L2 -> L3 -> L4 -> out
Unrolled (current framing): L1 -> [L2->L3] -> [L2->L3] -> L4 -> out
Looped (proposed): --<--loop----
| |
L1 -> [L2->L3] x N --> L4 -> out
"reasoning loop"Note: ascii rendering HN is not trivial
See the left-hand side of the diagram here, which is your exact proposal:
Great read, makes you wonder what else is encoded in these models that might be useful!
It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).
Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer
Pretty cool though. LLM brain surgery.
I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?
This would give 'organs' space to develop.
finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.
I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.