What is the metric on which these models are being judged?
It's hard to define a discrete rubric for grading at an inherently qualitative level. To keep things simple, this test is purely PASS/FAIL - unsuccessful means that the model NEVER managed to generate an image adhering to the prompt. For example, Midjourney 7 did not manage to generate the correct vertical stack of translucent cubes ordered by color in 64 generation attempts. In many cases, we often attempt a generous intepretation of the prompt - if it gets close enough, we might consider it a pass.
Put another way: if I were to show the final image to a random stranger on the street, would they be able to guess what the original prompt was? (aka the Pictionary test).
To paraphrase former Supreme Court Justice Potter Stewart, "I may not be able to define a passing image, but I know it when I see it."
To answer your question, the pass/fail is manually determined according to a set of well-defined criteria which is usually specified alongside the image.Now, are LLM judges flawed? Obviously. But they are more shelf stable than humans, so it's easier to compare different results. And as long as you use an LLM judge as a performance thermometer and not a direct optimization target, you aren't going to be facing too many issues from that.
If you are using an LLM judge as a direct optimization target though? You'll see some funny things happen. Like GPT-5 prose. Which isn't even the weirdest it gets.