You are saying that people are using quantized models haphazardly and talking about them haphazardly. I'll grant it's not the exact same thing as making them haphazardly, but I think you took the point.
The terms shouldn't be used here. They aren't helpful. You are either getting good results or you are not. It shouldn't be treated differently from further training on dataset d. The weights changed - how much better or worse at task Y did it just get?
- quality for your specific application
- time to first token
- inter-token latency
- memory usage (varies even for a given bits per weight)
- generation of hardware required to run
Of those the hardest to measure is consistently "quality for your specific application".
It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...
When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.
Some have the capability to figure it and can do it for both full precision and quantized. Most don't and cannot.