So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.
My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".
Many models, especially open weight ones, are served by a variety of providers in their lifetime. Each provider has their own reliability statistics which can vary throughout a model's lifetime, as well as day to day and hour to hour.
Not to mention that there are plenty of gateways that track provider uptime and can intelligently route to the one most likely to complete your request.
All models are tested through OpenRouter. The providers on OpenRouter vary drastically in quality, to the point where some simply serve broken models.
That being said, I usually test models a few hours after release, at which point, the only provider is the "official" one (e.g. Deepseek for their models, Alibaba for their own, etc.).
I don't really have any good solution for testing model reliability for closed-source models, BUT the outcome still holds: a model/provider that is more reliable, is statistically more likely to also give better results during at any given time.
A solution would be to regularly test models (e.g. every week), but I don't have the budget for that, as this is a hobby project for now.
Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.