upvote
There's no good scientific way to test a closed-source model with both nondeterministic and subjective output.

This example image was generated using the API on high, not the low reasoning version. (it is slow and takes 2 minutes lol)

reply
If the results are quantifiable/objective and repeatable it's scientific, how is it not scientific?

The reasoning amount is part of the evaluation isn't it?

reply
This is the best kind of science there is: direct, empirical test.
reply