upvote
You can really see this in the recent video generation where they try to incorporate text-to-speech into the video. All the tokens flying around, all the video data, all the context of all human knowledge ever put into bytes ingested into it, and the systems still completely routinely (from what I can tell) fails to put the speech in the right mouth even with explicit instruction and all the "common sense" making it obvious who is saying what.

There was some chatter yesterday on HN about the very strange capability frontier these models have and this is one of the biggest ones I can think of... a model that de novo, from scratch is generating megabyte upon megabyte of really quite good video information that at the same time is often unclear on the idea that a knock-knock joke does not start with the exact same person saying "Knock knock? Who's there?" in one utterance.

reply
Author here, yeah I think I changed my mind after reading all the comments here that this is related to the harness. The interesting interaction with the harness is that Claude effectively authorizes tool use in a non intuitive way.

So "please deploy" or "tear it down" makes it overconfident in using destructive tools, as if the user had very explicitly authorized something, and this makes it a worse bug when using Claude code over a chat interface without tool calling where it's usually just amusing to see

reply
I honestly think llms perform worse and worse with context. Whenever possible, I disable it because things just get wacky when it has more than a little bit of context available
reply
So easy it should disqualify you if you fail this: Knowing your own name.
reply