... that are therefore liable to be in the training data?
This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.
So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!
The goal was not to define unsolved problems.
But as such, the problems are also not previously published problems.
This seems quite reasonable IMHO.
> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.
So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.
A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.
Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.