upvote
Ever since I read about this, I have been thinking about the next logical step: train a NN to route the internal loops dynamically after each layer. Instead of just choosing a given set of layers that are repeated, let the new classifier decide whether it wants to loop, where it wants to loop, whether to loop multiple times, to loop a big part, or to just jump to the final layers straight away. Each token could loop more or less based on its relevance.

It has some similarities of a MoE architecture, but instead of choosing experts, it chooses layer routes. Training this NN classifier together with the LLM could condense the required amount of layers for a given intelligence down drastically if it works. If anyone wants to work on this, feel free to send me a message.

reply
Thanks!

I have pushed basic code to GitHub (https://github.com/dnhkng/RYS)

Some interesting areas to explore might be a combination of deleting some layers and duplicating others. i.e. reduce VRAM by dropping some layer (this works, well documented), and recovering performance by duplicating others (saves VRAM). I am not pursuing this, but it seems interesting!

reply
Thanks -- interesting. I like the idea of ablating layers. I guess you could get a differentiable stack that has a layer skip and layer copy/loop and a total memory use loss function; that would let someone ship either a big (usually ablate) or little (usually copy) model. The expert routing for longer sequences interests me a lot because the edge inference issue is always memory bandwidth.
reply