upvote
It's generally one-shot-only - whatever comes out the first time is what I go with.

I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".

reply
Try llm-consortium with --judging-method rank
reply
I think it will make results way better and more representative of model abilities..
reply
It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.
reply