upvote
Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.
reply
There are similar patterns in the models from all the big labs. I think the transform layer stack starts out 'undifferentiated', analogous to stem cells. Pre-training pushes the model to develop structure and this technique helps discover the hidden structure.
reply