undefined

points

by refulgentis5 hours ago |

comments

by 4 hours ago|

[-]

deleted

by skysniper5 hours ago|

prev|

[-]

sorry didn't know that. Here is my hand writing tldr:

gemini is very unreliable at using skills, often just read skills and decide to do nothing.

stepfun leads cost-effectiveness leaderboard.

ranking really depends on tasks, better try your own task.

by refulgentis5 hours ago|

parent|

[-]

It’s too late once it’s happened. I was curious, then when I saw the site looked vibecoded and you’re commenting with AI, I decided to stop trying to reason through the discrepancies between what was claimed and what’s on the site (ex. 300 battles vs. only a handful in site data).

by rat99884 hours ago|

parent|

[-]

Too late for what? For you? maybe. There are many others that are okay with it and it doesn't disminish the quality of the work. Props to the author.

by refulgentis4 hours ago|

parent|

[-]

> Too late for what? For you? maybe.

Maybe? :)

> There are many others that are okay with it

Correct.

> and it doesn't disminish the quality of the work.

It does affect incoming people hearing about the work.

I applaud your instinct to defend someone who put in effort. It's one of the most important things we can do.

Another important thing we can do for them is be honest about our own reactions. It's not sunshine and rainbows on its face, but, it is generous. Mostly because A) it takes time B) other people might see red and harangue you for it.

by skysniper4 hours ago|

parent|

prev|

[-]

all 300+ battle data are available at https://app.uniclaw.ai/arena/battles, every single battle is shown with raw conversional history, produced files, judge's verdict and final scores

by refulgentis4 hours ago|

parent|

[-]

Thanks! Is the judge an LLM? There's lot of references to "just like LMArena", but LMArena is human evaluated?

by skysniper4 hours ago|

parent|

[-]

> Is the judge an LLM?

Yes, judge is one of opus 4.6, gpt 5.4, gemini 3.1 pro (submitter can choose). Self judge (judge model is also one of the participants) is excluded when computing ranking.

> There's lot of references to "just like LMArena", but LMArena is human evaluated?

Yeah LMArena is human evaluated, but here i found it not practical to gather enough human evaluation data because the effort it take to compare the result is much higher:

- for code, judge needs to read through it to check code quality, and actually run it to see the output

- when producing a webpage or a document, judge needs to check the content and layout visually

- when anything goes wrong, judge needs to read the execution log to see whether partial credit shall be granted

if you look at the cost details of each battle (available at the bottom of battle detail page), judge typically cost more than any participant model.

if we evaluate with human, i would say each evaluation can easily take ~5-10 min

by refulgentis3 hours ago|

parent|

[-]

Fair enough, yeah, agent evals are hard especially across N models :/

Thanks for replying btw, didn't mean any disrespect, good on you for not getting aggro about feedback

by skysniper3 hours ago|

parent|

[-]

I appreciate honest feedback, best way to learn :)