> tool call pruning breaks cache and people will tell you this is horrible and expensive
> except i looked at some anthropic data and real user behavior ends up with better cache hits and 30% less spend
> even this is needs to be analyzed further, it's just not simple
> for openai data it's inverted! cache hit ratio is actually better [sic: I think he meant worse based on the screenshot] with tool call pruning turned on
> but the net $ saved is only 5%
> kimi is a funny one - it has better cache hits with pruning on...but is also more expensive!
There was also another thread recently where he discussed that pruning improves user experience (models are smarter with less context) but I can't find it.
This can also be disabled in the config: https://opencode.ai/docs/config/#compaction
Ah, reminds me of good old "There are only 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
You quip, but LLM KV caching (from the harness side) is quite easy: You get a cache hit on stable prompt prefixes, period. That means you want to keep the prefix stable, and only append at the end of the conversation. Made up example: Don't put the git branch name into the system prompt part (that comes first), as whenever the branch name changes, that'd trigger a cache invalidation of the entire prompt.
Getting this right requires some care to not by accident modify the prefix, basically, and some design on communicating the things that can change (user configuration, working dir, git information, ...).
On the sheer performance it’s comparable to Opus ?
They explain some of the the reasons why they have a better solution and why they are very opinionated
>Automatic prefix caching activates only when the exact byte prefix of the previous request matches. Most agent loops reorder, rewrite, or inject fresh timestamps each turn — cache hit rate in practice: <20%.
So they optimize on this plus other techniques to improve cache hits, making it cheaper.
Can you share the bridge. DeepSeek v4 is awesome paired with claude-code or opencode. I found that claude code costs me less than opencode and I am presuming this is due to a better engineered harness.
I only used it for a few hours to play around with stuff before the quota issue was fixed and I could resume using GPT models, and the bridge was coded by DeepSeek-V4-Flash-IQ2XXS + DwarfStar4 locally, I take no responsibility for what might happen with your computer or you, during usage or just reading the code.
Edit: heh, like don't look at line 117 for example where seemingly it likes to handle misspellings in the .env file which totally wasn't my fault for typo'ing the API key in that file... I'm sure there are tons of sharp edges and dumb stuff in there.
Obviously, if you do deal with any sort of secrets, then using local LLMs over OpenAI, Anthropic, DeepSeek or whoever is obviously preferred, and in the case of personal data of users, probably a requirement.
Getting the source code of facebook or instagram doesn't mean you could compete with them.
I work for a company that has built relationship with event organizers over the past 10 years. The code I maintain could be written from scratch in maybe 2-3 months even though it was built over the past 10 years but besides that you have frontend / DB / hardware / logistics etc
Still, “Getting the source code of facebook or instagram doesn't mean you could compete with them.” I think to giants like that, having access to their source code could open up some very interesting loop holes for manipulating the ranking algorithms, or even security vulnerabilities.
Honestly I'd love to love the US again, but basically after Obama things have just gone down and down and no soul will trust the US again in the next generation or two.
Same with codex? codex-rs at least, is a TUI as well, it does run a "app-server" in the background, that the TUI actually interacts with, but that's just an implementation detail. Also makes it easy to hook in your own programs to fire of codex "headless" sessions even without the TUI.