upvote
I find the RYS result far more surprising than the ERD result. Encode-Reasoning-Decode is after all a very popular way to design neural networks (even an autoencoder is just that without the reasoning step), the same structure emerging from optimization isn't that surprising.

But the methodology to measure it and put numbers on which layers are most involved in encoding/decoding and where the reasoning takes place is very valuable.

The finding that the phases are more cleanly separated in large-ish models is interesting. I wonder what this could mean for embedding models? Usually we take small LLMs and chop off the last couple layers to get an embedding model. But I wonder if you could get better embedding models using something like the first five layers of Qwen3.5-27B, or the first X layers of Kimi K2.5? The methodology in the article seems to give a straight forward way to find the optimal cutting point

reply
Perhaps not widely known but certainly known in LLM research. There was a bunch of these experiments done 2 years ago and what's interesting is that it still seems to work on the latest models.

Though beware that the increased score on math and EQ could lead to other areas scoring less well; would love to see how these models score on all open benchmarks.

reply
The author claimed that the models he modified with this layer repetition method topped the huggingface open llm leaderboard in his first post: https://dnhkng.github.io/posts/rys/

Do you remember the names of the previous experiments done on this? Would love to take a look.

reply
Just learned about it the other day from this thread from Feb, 2024: https://old.reddit.com/r/LocalLLaMA/comments/1aqrd7t/i_made_...

Has some interesting github links.

reply