upvote
Yes, Flash doesn't seem to have the same rate limits as Pro.

I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.

reply
Why would your test be including scores of failed responses/runs? That seems confusing.

(I am confused by the results your website is presenting)

reply
Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.

So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.

My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".

reply
@danyw, we reached max comment thread depth

Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.

reply
Wouldn't you need to re-run across lots of samples (even for a single eval/bench) to avoid outsized impacts from just bad luck?
reply