upvote
GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.
reply
but that's without internet search - everyone I know uses the models that search when they need to, and I'm sure GPT and Opus would agree on almost everything if 1) they searched when necessary, and 2) they were allowed to give context to their answers instead of being hamstrung to get specious "research" results.
reply
Looks like they land at the average number of 67% disagreement.
reply
I agree but the market is pricing way beyond that
reply