The prompt allowed for exactly four valid outputs and explicitly disallowed explanations and qualifiers.
> Output exactly one label: True, > Mostly True, Misleading, or False. > No explanations, no qualifiers.
How is that a nuanced response?
> These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking"
My suggestion is that five presumably reasoning and thinking humans would also have variation in their responses to the exact same prompt.