The thing about Muon is that it doesn't have this specific feature of ADAM that causes it to "move along the diagonal". Basically if you flatten weights as a huge vector of a few billion elements. SGD moves along the gradient, which isn't biased. ADAM normalizes everything elementwise, so it sort of moves along a vector of +-1.
This isn't a proof or anything, but what you can imagine might be happening is that if you move along +-1, then you find spikey solutions somehow. Not sure how to prove that. Muon doesn't really do this, but it has its own sort of funky reshaping of the update (it moves along low rank directions).
[0] https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optim...