You talk as if problem solving is a supervised (imitation) learning problem. No, it is a reinforcement learning problem, models learn by solving problems and getting rated. They generate their own training data. Optimal budget allocation is 1/3 cost pre-training, 1/3 for RL, and 1/3 on inference.
> The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement.
How is this different from "probabilistic pattern selection"?