undefined

points

[-]

Have a look at the boundaries in the heatmaps.

They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.

This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?

by doctorpangloss4 hours ago|

parent|

[-]

Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.

by dnhkng1 hours ago|

parent|

[-]

There are similar patterns in the models from all the big labs. I think the transform layer stack starts out 'undifferentiated', analogous to stem cells. Pre-training pushes the model to develop structure and this technique helps discover the hidden structure.