upvote
On problems this close to active research, seeing the model’s internal reasoning at the points of highest effort is more valuable than pass/fail outcomes alone, which is what SRT-Introspect makes possible on frozen models.

https://github.com/space-bacon/SRT

reply
But it still remains far away from mathematics research. Solving any of the problems would not result in a new research paper.
reply
Was this event sponsored by Surge AI? Why didn't you run the prompts yourself?
reply
No, they only provided large-scale model runs for us (this is explained in the ackonowledgements). These runs would have been too expensive to perform myself, so I am happy they offered to provide them.
reply
Thanks for answering this random internet guy's question. It's a bit sad that a german math prof doesn't have sufficient funds to run a few prompts. I would have paid for them for this amount of advertising. I don't like that you gave them to a silicon valley company.

On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.

reply
Please indicate which other models you would like to see included. (And I agree that the context window limitations were not reasonable to have.) Finally: running this few prompts would have been $10-20k if I would have run them myself via the API. (And the company didn't asked to contribute, but I asked whether they would be willing to do so, just saying.)
reply
Kimi K2.6 and mimo 2.5 pro are ahead of deepseek v4 in other benchmarks. Anyhow, great work, the benchmark seems to show great separation, so should be very useful to improve the math capabilities of the next generation of ai. I'm more interested in the prompt engineering/orchestration and technical details (what I can do without millions), but I get that you are mathematicians, so your focus is obviously on the math. Sorry for the nagging.
reply
I don't like that you've called these problems "research-level", or your description that they are something you might give to a second-year PhD student. Some examples:

- Question 093 is a word problem of the kind that I would imagine is commonly given to high school students. Maybe it is slightly more difficult, but it doesn't appear to have any mathematical relevance and nobody would ever give it to a second-year PhD student.

- Question 096 is something I would expect a computer to do easily by brute force, and has essentially no mathematical content other than doing a calculation. (Under what circumstance does one care about taking base 10 digits and interpreting them in base 11?). Again, nobody would ever assign this to a math PhD student, and I expect that any undergrad who knows how to code can give you this answer.

- Question 016 is the kind of combinatorial problem that one could expect to brute force with a computer (and some decently-written code) even before AI. Again nobody would give it to a 2nd year PhD student because it is too random and of no academic interest.

- There are questions like 026 and 014, about computing Hilbert series. Computing Hilbert series is a standard computer algebra task that nobody would want to do by hand before generative AI, and certainly not now.

Similar comments apply to many others. There are plenty of random-looking computational questions of exactly the type that one expects not only that computers cans solve, but should be used to solve, because nobody would ever do it by hand. None of them are research-level --- certainly not anything that would be considered publishable (before generative AI or after) --- despite the subtitle of the paper saying "research-level". And if you give them to a 2nd year PhD student I would imagine you would just be wasting their time.

I also don't like your phrasing "much harder than any exam question in any exam". If I ask you to multiply two 1000 digit numbers, the question is "much harder" than any question that will ever appear on any exam. Everyone understands the computer will do it instantly, and it doesn't demonstrate anything relevant. There is a clear regime in which one expects AI-type methods to perform better (combinatorial, calculation-based questions which can be answered using standard methods), and other regimes where one expects worse performance (e.g., proofs of statements that use abstract concepts). Why is there nothing here of the second type?

reply
I cannot keep answering everyone's comments of the type "Why did you consider / not consider?" or "Here are much better ideas". I promise you that we have thought quite a bit about the setup and have discussed it with many math researchers.

1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?

2. The questions are of theoretical nature, even if a little calculation is involved. This does not mean that the problems are not solvable using a computer program, but it means that they are not solvable with reasonalble effort with a computer program.

3. And we do not ask for proofs because other projects already do that (IMProofBench, please have a look) and we cannot grade LLM answers as a human would need to understand the provided proof -- and this is not what I or we or actually most researchers are interested in doing.

reply
> 1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?

The objection is to phrasing "much harder". One should distinguish between something that is difficult for reasons stemming from a lack of computational power and something that is difficult for reasons stemming from a lack of relevant abstractions or the ability to grapple with them. If the reason that a particular problem is "hard" for a PhD student is that they have to do a long calculation, but not because of a lack of conceptual understanding, then it doesn't say much about the capabilities of generative AI if the computer solves it.

Hence the example: multiplying two large numbers is hard for the former reason, not the latter. Your example of factoring a 4096-bit semiprime is hard for both reasons (because the brute force method is too slow).

reply
Well, you are correct that one should distinguish the two. But we give no indication that the questions are hard because of computational tasks and we give many indications that the problems are of theorecical nature and hard for theoretical reasons. There is not a single question where a PhD student would need to do a long calculation.

I trust the judgement of respected researchers submitting the questions, I personally know them, and they publish research under their full names (and whose names are fully disclosed in the paper). And you also should trust them.

Please consider disclosing your name and your field of expertise, pick a question in your own research area and explain to me why this question is not research-level. And, best of all, solve it yourself to clarify why it was too easy.

reply
I solve 034.

By [1, Theorem 4.1], the Neron-Severi rank of the perfectoid cover is the same as the Neron-Severi rank of the reduction. For a product E x E' of elliptic curves, it is well known that NS(E x E') = NS(E) + NS(E') + Hom(E,E'); see [2, Prop. 2.3]. Since E = E' here and E is supersingular, this number is 1 + 1 + 4 = 6.

Is it research level? It of course takes a graduate student a long time to understand, say, what a perfectoid space is. But the statement follows immediately from quoting the literature, as long as one knows what to quote.

1. https://arxiv.org/pdf/2105.05230 2. https://arxiv.org/pdf/1402.2233

reply
You see yourself that your own solution is purely of theoretical nature and not at all what you wrote before, right? (And no, I am not commenting on your answer.)
reply
Haha, the classic “Why didn’t you do X?” comments always appear. I think a lot of people underestimate how much quality researchers deeply think about such setups. My genuine standard rely to those folks is - do the research with your setup and publish it.
reply
What would have been more interesting is if LLMs were tested with questions where the direct solutions are not publicly available (so not in training data). In that case I wonder how much of hallucinations would happen or if it tries to connect dots with what’s available publicly and come up with a direct solution
reply
I don't understand why you expect that an answer known to the researcher but which has never been published should be in the training data. You possibly missunderstand what these problems look like -- we made them all publicly available on the website, so please have a look: https://math.sciencebench.ai/benchmarks/benchmarks-in-leipzi...
reply
deleted
reply