upvote
In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.
reply
ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.
reply
There are papers that try to quantize angles associated with weights because angles have a more uniform distribution. I haven't read this specific paper, but it looks like it uses a similar trick at a glance.
reply