This is what is special about them:
> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;
I.e. these are problems of some practical interest, not just performative/competitive maths.
And this is what is know about the solutions:
> the answers are known to the authors of the questions but will remain encrypted for a short time.
I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
FrontierMath did this a year ago. Where is the novelty here?
> a solution is known, but is guaranteed to not be in the training set for any AI.
Wrong, as the questions were poses to commercial AI models and they can solve them.
This paper violates basic benchmarking principles.
Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?
Of course a math expert could solve the problems themselves and lie by saying that an AI model did it. In the same way, somebody with enough money could secretly film a movie and then claim that it was made by AI. That's outside the scope of what this paper is trying to address.
The point is not to score models based on how many of the problems they can solve. The point is to look at the models' responses and see how good they are at tackling the problem. And that's why the authors say that ideally, people solving these problems with AI would post complete chat transcripts (or the equivalent) so that readers can assess how much of the intellectual contribution actually came from AI.