These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking" tokens as a concept are mostly there to convince people to use models that consume more tokens and produce more revenue. The output from reasoning models might be more accurate, but its just a consequence of a longer inference runtime, there is no "reasoning" happening, reasoning is just sales/UX bullsh*t.
The prompt allowed for exactly four valid outputs and explicitly disallowed explanations and qualifiers.
> Output exactly one label: True, > Mostly True, Misleading, or False. > No explanations, no qualifiers.
How is that a nuanced response?
> These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking"
My suggestion is that five presumably reasoning and thinking humans would also have variation in their responses to the exact same prompt.