Which LLM should we even use to judge taste? Is it giving an unfair advantage to Model X if we use Model X as the judge? Maybe we should use multiple models as the judge, but now the model that's best at recognising and praising its own code has an advantage. The whole thing is just an unsolvable problem when a LLM is the judge.
There have been studies that showed that models tended to rate responses from their own family of models better than equivalent responses from other families, eg. gpt-4 would prefer a response from gpt-3