upvote
Try llm-consortium with --judging-method rank
reply
I think it will make results way better and more representative of model abilities..
reply
It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.
reply