upvote
I had the same thought, because even if the exact solution doesn't appear there's a notable difference between performing a literature search versus solving something de novo. But I think perhaps this benchmark wasn't meant to exclude the former and that the point may have been to test the ability of the model to accurately interpret and synthesize relevant output for research level mathematical problems at all.
reply
I think you are underestimating the complexity of such problems. A PhD in the exact field of research would need days to weeks to understand what the problem means and how to solve it. This is far beyond "throwing standard techniques" at a problem. (But, I keep emphasizing this, it is also far away from solving research mathematics.)
reply
What did I say that led you to believe I was underestimating the complexity? I don't believe I commented on it at all.
reply
When you write "there's a notable difference between performing a literature search versus solving something de novo", you suggest that the questions we provided can be solved doing a literature search.

This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.

So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!

reply
I can recommend reading section 2 of the paper.

The goal was not to define unsolved problems.

But as such, the problems are also not previously published problems.

This seems quite reasonable IMHO.

reply
Partially, 2.2 Submission workflow W2 deals with this:

> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.

So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.

reply
“In the training data” isn’t really relevant for a modern LLM. The better question would be are they solvable using known techniques that have been fine-tuned in.

A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.

Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.

reply