undefined

upvote

points

by simonw19 hours ago |

upvote

by irthomasthomas19 hours ago|

[-]

Try llm-consortium with --judging-method rank

reply

upvote

by andriy_koval19 hours ago|

[-]

I think it will make results way better and more representative of model abilities..

reply

upvote

by simonw19 hours ago|

[-]

It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.

reply