upvote
This isn't a good test for any model since LLMs can't math (even though frontier models can sometimes correctly simulate mathing), which is why one would always use a tool for this.
reply
Almost about to try it until I saw this. If it's Siri the Silly don't even make up for the opportunity cost.
reply
Real experience I've had:

"Text Carol bring me a glass of water please"

"I'm sorry, I don't see a 'Carol Bring' in your contacts"

reply