> Is it giving an unfair advantage to Model X if we use Model X as the judge?
There have been studies that showed that models tended to rate responses from their own family of models better than equivalent responses from other families, eg. gpt-4 would prefer a response from gpt-3