At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.
The results were near useless.
The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.
Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.
Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.
I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!
Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.
I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.
My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.
The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.
A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.
I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.
(Joking aside, see sibling threads.)
Did you add "make no mistake" to your prompt?
Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.
But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!
I am pretty sure this time I am catching the sarcasm here. Kudos you had me in the first half.
These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.
A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...
Being a theory builder more than a problem solver I am excited for the future.
Also excited for fully formalised mathematics to hit main stream!