The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.
I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.