The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.