In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
https://www.reddit.com/r/ClaudeAI/comments/1psxuv7/anthropic...
Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.