Sorry for gushing, but I'm amazed that the AI got so far just from "book learning", without never stepping into a hospital, or even watching an episode of a medical drama, let alone ever feeling what an actual arm is like.
If we have actually reached the limit of book learning (which is not clear to me), I suppose the next phase would be to have AIs practice against a medical simulator, whereby the models could see the actual (simulated) result of their intervention rather than a "correct"/"incorrect" response. Do we have actually have a sufficiently good simulator to cover everything in such questions?
As for your suggestion on learning from simulations, it sounds interesting, indeed, for expanding both pre and post training but still that wouldn’t address this problem, only hides the shortcomings better.
Can you say more about why you believe this? To me, these questions seem to be exactly of the same sort of question's as on HLE [0], and we've been seeing massive and consistent improvement on it, with o1 (which was evaluated on this question) getting a score of 7.96, whereas now it's up to a score of 37.52 (gemini-3-pro-preview). It's far from a perfect benchmark, but we're seeing similar growth across all benchmarks, and I personally am seeing significantly improved capabilities for my use cases over the last couple of years, so I'm really unclear about any fundamental limits here. Obviously we still need to solve problems related to continuous learning and embodiment, but neither seems a limit here if we can use a proper RL-based training approach with a sufficiently good medical simulator.
The simulator or world-model approach is being investigated. To your point, textual questions alone do not provide adequate coverage to assess real-world reasoning.
The real solution is to have 4 AI answer and let the human decide. If all 4 say the same thing, easy. If there is disagreement, further analysis is needed.
Are two heads better than one? The post explains why an even number doesn't improve decision-making.
Would that still be relevant here?
You could change the standards. If any of the 4 fail, then reject the data.