upvote
They're asking what precision the model parameters were quantized/rounded to, look up "LLM quantization" to learn more.

The old headline was something like "We served GLM5.2 on AMD MI355X at 2626 tok/s/node". They think it's bad to advertise performance numbers like that without specifying how much you quantized the model (Since you can 'cheat' more performance by quantizing more, and not mentioning the reduced output quality)

reply