It was terrible. You could upload 30 pages of financial documents and it would decide "yeah this doesn't require reasoning." They improved it a lot but it still makes mistakes constantly.
I assume something similar is happening in this case.
With a small bounded compute budget, you're going to sometimes make mistakes with your router/thinking switch. Same with speculative decoding, branch predictors etc.