undefined

upvote

points

by vorticalbox8 hours ago |

upvote

by robertkarl5 hours ago|

[-]

https://arxiv.org/abs/2606.00206

In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.

reply

upvote

by meatmanek2 hours ago|

[-]

This is super cool. Do you know if any of the inference backends (llama.cpp, vllm, etc) support this technique?

reply

upvote

by giancarlostoro6 hours ago|

[-]

I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.

Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.

reply

upvote

by xstas16 hours ago|

[-]

XML??

reply

upvote

by giancarlostoro6 hours ago|

[-]

Apparently because of how Claude is trained, even the system level prompts go through as XML, it works better with XML "prompting" so I figured I could have it write plans in XML. I need to update my ticketing tool to output XML maybe by default.

https://www.reddit.com/r/ClaudeAI/comments/1psxuv7/anthropic...

reply

upvote

by saltsucker5 hours ago|

[-]

Comments later in thread say markdown works just as fine and that it’s more important to organize your plan into sections.

Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference

reply

upvote

by swingboy3 hours ago|

[-]

Anthropic’s best practices still include the use of XML: https://platform.claude.com/docs/en/build-with-claude/prompt...

reply

upvote

by adastra224 hours ago|

[-]

A year or so ago XML worked more reliably for long-lived prompt instructions. Now it is cargo culting.

reply

upvote

by root-parent5 hours ago|

[-]

XML stands for Xtra ML....

reply

upvote

by noworriesnate4 hours ago|

[-]

I'd like to switch to a sales career--can you give me any pointers?

reply

upvote

by mikeocool7 hours ago|

[-]

Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.

Just output the code and we’ll work through it!

I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.

reply

upvote

by SubiculumCode3 hours ago|

[-]

A lot of times this is how humans work. Just start 'putting words on paper', 'think by doing', etc. sometimes it's more efficient to see why something won't work after writing a bit of it, and sometimes you get lucky and it works right off the bat

reply

upvote

by drob5183 hours ago|

[-]

Qwen is notorious for this, too. It’ll sometimes spin in a long loop of “But wait…” paragraphs.

reply

upvote

by thinkingtoilet7 hours ago|

[-]

I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.

reply

upvote

by epolanski8 hours ago|

[-]

Fable was 20 times worse on that.

It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.

reply

upvote

by RyanHamilton7 hours ago|

[-]

Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!

reply

upvote

by epolanski7 hours ago|

[-]

I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.

reply

upvote

by celrod1 hours ago|

[-]

Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance? That is what I want. I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.

reply

upvote

by happyPersonR5 hours ago|

[-]

more thinking == more tokens === more money LOLL

reply

upvote

by overfeed3 hours ago|

[-]

Os there a cost benchmark out there? I wonder how frontier models are doing over time for cost per problem solved.

reply

upvote

by drob5183 hours ago|

[-]

I think they are optimizing for one-shot performance because that will drive usage. They can’t afford to look bad in the benchmarks. And if that means consuming an order of magnitude more tokens, well, that’s good for business, too.

reply

upvote

by 8 hours ago|

[-]

deleted

reply