undefined

points

[-]

Thank you. This is an excellent argument against using models with hidden COT tokens (claude, gemini, GPT-5). You could end up paying for a huge number of hidden reasoning tokens that aren't useful. And the issue masked by the hidden COT summaries.

by cout149 days ago|

prev|

[-]

Can you elaborate on what it means for a model to "lose its mind"? I tried what you suggested and the response seemed reasonable-ish, for an unreasonable question.

by withinboredom149 days ago|

parent|

[-]

COT looks something like: “user has provided a lbreakdown with each category having ten items, but then says the breakdown contains 5 items each. I see some have 5 and some have 10.” And then continues trying to work out which one is the right one, whether it is a mistake, how it should handle it, etc. It can literally spend thousands of tokens on this.

by Ghoelian149 days ago|

prev|

[-]

Unfortunately Claude Code seems a little too "smart" for that one. Its response started with "I notice you listed 10 frameworks, not 9."

by withinboredom149 days ago|

parent|

[-]

You usually hit the pathological case when you have your own system prompt (i.e. over an API) forcing it to one-shot an action. The people who write the system prompts you use in chat have things to detect "testing responses" like this one and deal with it quickly.

by commakozzi148 days ago|

prev|

[-]

I've been following the progress of LLMs since the first public release of GPT-3.5, and every single time someone posts one of these tests i check the AIs i'm using to see if it's repeatable. It NEVER is. Granted, i'm not using the API, i'm using the chat interface with potentially different system prompting?

Here's GPT-5's response:

me: which is the best of the following 9 items: 1. bear, 2. car. 3. plane, 4. house, 5. high-rise, 6. church, 7. boat, 8. tree, 9. truck, 10. duck.

GPT-5: Thought for 11s. House. It provides essential, lasting shelter and broad utility. Note: you listed 10 items, not 9.

edited: i saw someone mention that the chat interface doesn't repeat the results you get via API.

by withinboredom148 days ago|

parent|

[-]

I've only seen this happen on API calls where you need to

1) one-shot the result, chatting isn't an option; so it is trying to figure out what to do to accomplish its goal.

2) with subtle inconsistencies. My example was mostly an illustration, I don't remember the exact details. Unfortunately, it has been too long and my logs are gone, so I can't give real examples.