It solve the "I'm coding on the plane and need to look up this thing I've forgotten" problem, for me at least
So there is no value in testing quality of answers, but there is value in testing token speed.
You just have to have correct expectations.
For me local models is all about quality, and how to achieve that - e.g. by providing guardrails that test the job done.