undefined

points

[-]

GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.

by TaupeRanger3 hours ago|

parent|

[-]

but that's without internet search - everyone I know uses the models that search when they need to, and I'm sure GPT and Opus would agree on almost everything if 1) they searched when necessary, and 2) they were allowed to give context to their answers instead of being hamstrung to get specious "research" results.

by spprashant4 hours ago|

prev|

[-]

Looks like they land at the average number of 67% disagreement.

by airstrike4 hours ago|

prev|

[-]

I agree but the market is pricing way beyond that