I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.
(I am confused by the results your website is presenting)
So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.
My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".
Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.