On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.
- Question 093 is a word problem of the kind that I would imagine is commonly given to high school students. Maybe it is slightly more difficult, but it doesn't appear to have any mathematical relevance and nobody would ever give it to a second-year PhD student.
- Question 096 is something I would expect a computer to do easily by brute force, and has essentially no mathematical content other than doing a calculation. (Under what circumstance does one care about taking base 10 digits and interpreting them in base 11?). Again, nobody would ever assign this to a math PhD student, and I expect that any undergrad who knows how to code can give you this answer.
- Question 016 is the kind of combinatorial problem that one could expect to brute force with a computer (and some decently-written code) even before AI. Again nobody would give it to a 2nd year PhD student because it is too random and of no academic interest.
- There are questions like 026 and 014, about computing Hilbert series. Computing Hilbert series is a standard computer algebra task that nobody would want to do by hand before generative AI, and certainly not now.
Similar comments apply to many others. There are plenty of random-looking computational questions of exactly the type that one expects not only that computers cans solve, but should be used to solve, because nobody would ever do it by hand. None of them are research-level --- certainly not anything that would be considered publishable (before generative AI or after) --- despite the subtitle of the paper saying "research-level". And if you give them to a 2nd year PhD student I would imagine you would just be wasting their time.
I also don't like your phrasing "much harder than any exam question in any exam". If I ask you to multiply two 1000 digit numbers, the question is "much harder" than any question that will ever appear on any exam. Everyone understands the computer will do it instantly, and it doesn't demonstrate anything relevant. There is a clear regime in which one expects AI-type methods to perform better (combinatorial, calculation-based questions which can be answered using standard methods), and other regimes where one expects worse performance (e.g., proofs of statements that use abstract concepts). Why is there nothing here of the second type?
1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?
2. The questions are of theoretical nature, even if a little calculation is involved. This does not mean that the problems are not solvable using a computer program, but it means that they are not solvable with reasonalble effort with a computer program.
3. And we do not ask for proofs because other projects already do that (IMProofBench, please have a look) and we cannot grade LLM answers as a human would need to understand the provided proof -- and this is not what I or we or actually most researchers are interested in doing.
The objection is to phrasing "much harder". One should distinguish between something that is difficult for reasons stemming from a lack of computational power and something that is difficult for reasons stemming from a lack of relevant abstractions or the ability to grapple with them. If the reason that a particular problem is "hard" for a PhD student is that they have to do a long calculation, but not because of a lack of conceptual understanding, then it doesn't say much about the capabilities of generative AI if the computer solves it.
Hence the example: multiplying two large numbers is hard for the former reason, not the latter. Your example of factoring a 4096-bit semiprime is hard for both reasons (because the brute force method is too slow).
I trust the judgement of respected researchers submitting the questions, I personally know them, and they publish research under their full names (and whose names are fully disclosed in the paper). And you also should trust them.
Please consider disclosing your name and your field of expertise, pick a question in your own research area and explain to me why this question is not research-level. And, best of all, solve it yourself to clarify why it was too easy.
By [1, Theorem 4.1], the Neron-Severi rank of the perfectoid cover is the same as the Neron-Severi rank of the reduction. For a product E x E' of elliptic curves, it is well known that NS(E x E') = NS(E) + NS(E') + Hom(E,E'); see [2, Prop. 2.3]. Since E = E' here and E is supersingular, this number is 1 + 1 + 4 = 6.
Is it research level? It of course takes a graduate student a long time to understand, say, what a perfectoid space is. But the statement follows immediately from quoting the literature, as long as one knows what to quote.
1. https://arxiv.org/pdf/2105.05230 2. https://arxiv.org/pdf/1402.2233