I found the opposite. GPT-5 is better at judging along a true gradient of scores, while Gemini loves to pick 100%, 20%, 10%, 5%, or 0%. Like you never get a 87% score.
I am running similar experiment but so far, changing the seed of openai seems to give similar results. Which if that confirms, is concerning to me on how sensitive it could be