- quality for your specific application
- time to first token
- inter-token latency
- memory usage (varies even for a given bits per weight)
- generation of hardware required to run
Of those the hardest to measure is consistently "quality for your specific application".
It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...
When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.
Some have the capability to figure it and can do it for both full precision and quantized. Most don't and cannot.