I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.
I have done some interesting work on applying multiple layer duplications in different regions of the model too, going so far as to train a meta-model (actually just XGBoost) to predict the merges. Seems to work, buts thats a whole other blog post.
This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...
Normal:
L1 -> L2 -> L3 -> L4 -> out
Unrolled (current framing): L1 -> [L2->L3] -> [L2->L3] -> L4 -> out
Looped (proposed): --<--loop----
| |
L1 -> [L2->L3] x N --> L4 -> out
"reasoning loop"Note: ascii rendering HN is not trivial
See the left-hand side of the diagram here, which is your exact proposal: