undefined

points

by dannyw1 hours ago |

comments

by XCSme1 hours ago|

[-]

Yes, Flash doesn't seem to have the same rate limits as Pro.

I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.

by wolttam57 minutes ago|

parent|

[-]

Why would your test be including scores of failed responses/runs? That seems confusing.

(I am confused by the results your website is presenting)

by XCSme23 minutes ago|

parent|

[-]

Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.

So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.

My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".

by XCSme7 minutes ago|

parent|

[-]

@danyw, we reached max comment thread depth

Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.

by dannyw14 minutes ago|

parent|

prev|

[-]

Wouldn't you need to re-run across lots of samples (even for a single eval/bench) to avoid outsized impacts from just bad luck?