upvote
I can't reproduce this. Both high and low effort got it right
reply
It seems like they’ve been optimising their models for coding. That’s what the benchmarks used in the article suggest at least.
reply