My hunch is that Opus scale models probably have shortcuts encoded into the model that handle these ambiguities cases, wheres this model has learned a program to reason through the edge case (crystalized vs fluid intelligence). Remembering that probablity (frontier) vs calculating it on the fly (vibethink)
> [...]
> LLM-based Query Quality Filtering. We utilize capable LLMs to assess query quality, filtering out samples with incomplete descriptions, unreasonable conditions, invalid logic, or an inability to effectively assess target knowledge points.