I’ve seen "thinking models" go off the rails trying to deduce what to do with ten items and being asked for the best of 9.
[1]: the reality of the situation is subtle internal inconsistencies in the prompt can really confuse it. It is an entertaining bug in AI pipelines, but it can end up costing you a ton of money.
Here's GPT-5's response:
me: which is the best of the following 9 items: 1. bear, 2. car. 3. plane, 4. house, 5. high-rise, 6. church, 7. boat, 8. tree, 9. truck, 10. duck.
GPT-5: Thought for 11s. House. It provides essential, lasting shelter and broad utility. Note: you listed 10 items, not 9.
edited: i saw someone mention that the chat interface doesn't repeat the results you get via API.
1) one-shot the result, chatting isn't an option; so it is trying to figure out what to do to accomplish its goal.
2) with subtle inconsistencies. My example was mostly an illustration, I don't remember the exact details. Unfortunately, it has been too long and my logs are gone, so I can't give real examples.