undefined

points

[-]

I've seen enough of this study to be confident in warning people not to take it at face value.

The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.

I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.