undefined

points

[-]

It's possible that the gains are despite the noise the coarse process introduces. After two repetitions the noise may overwhelm the advantage.

The residual connections resemble the Euler method (this observation led to Neural ODE's IIRC) which isn't known to be exactly clean. If the model has been trained to be a particular number of layers, adding more layers will also add a lot of noise.

Ultimately, the LLM will need to be fine tuned with the loops or a looped architecture trained from scratch, such as: <https://ouro-llm.github.io> unfortunately they made the mistake of looping the entire LLM rather than just the center portion.

by skyde6 hours ago|

prev|

[-]

Actually not surprised. I guess this is for the same reason “say it twice” [1] is working. Because LLm are trained as causal language model, past token cannot attend to future token. One copy of the layer set solve this. [1]https://arxiv.org/html/2512.14982v1