upvote
From your second paper:

  > In particular, we can generate fixed random rotation matrices at initialization, and multiply them into the activations any time we read from or write to the residual stream. 
I guess I was mistaken in assuming this part was part of the TurboQuant-specific innovations. Still an interesting concept though
reply
Do you know if this also applies to the muon optimizer? It seems to be replacing adamw
reply
My guess is that probably not for Muon. What I said about ADAM was partly based on this blogpost I read some time ago, should have cited it as well [0].

The thing about Muon is that it doesn't have this specific feature of ADAM that causes it to "move along the diagonal". Basically if you flatten weights as a huge vector of a few billion elements. SGD moves along the gradient, which isn't biased. ADAM normalizes everything elementwise, so it sort of moves along a vector of +-1.

This isn't a proof or anything, but what you can imagine might be happening is that if you move along +-1, then you find spikey solutions somehow. Not sure how to prove that. Muon doesn't really do this, but it has its own sort of funky reshaping of the update (it moves along low rank directions).

[0] https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optim...

reply