> Agentic search avoids those failure modes. There's no embedding pipeline or centralized index to maintain as thousands of engineers commit new code. Each developer's instance works from the live codebase.
The frame of "the way a software engineer would" and the conclusion seem at odds. I'd love to be schooled otherwise?
I use autocomplete/LSPs all the time and they're useful. That's an index? Why wouldn't Claude be able to use one? Also a "software engineer" remembers the codebase - that's definitely a RAG. I have a lot of muscle memory to find the file I need through an auto-completed CMD+P.
It doesn't need to particularly be real-time across thousands of engineers -- just the branch I'm on.
It's rare that I'd be navigating a codebase from first-principles traversal. It would usually be a new codebase and in those cases it's definitely not what I'd call an optimal experience.
> Claude Code is running in production across multi-million-line monorepos, decades-old legacy systems, distributed architectures spanning dozens of repositories (…)
So it is optimized for the general case, using robust tooling that works everywhere, especially when large & messy.
That being said, your remark is right and for well organised smaller repo’s there’s better tooing it can and should use. But I think it does, at least Codex does is my case so I guess Claude does it to. For example Codex use ‘go doc’ first before doing greps.
Anthropic's target should be a codebase designed for agentic comprehension from the first commit. Here the codebase adapts to the agent. You can enforce conventions, structured metadata, semantic indexing, explicit dependency graphs. Whatever makes the agent's job trivial rather than heroic.
Where "robust tooling" is "grep with various regexes while completely missing the big picture even in small codebases"
Simple - It even eats up to 35% five hour usage limit in first prompt even on small projects and then there's 5 minutes time out for you to respond quickly or caches would go bust and you'll pay another 12% to 15% on the next prompt.
Are LLMs that super reliable in their output already with all the guardrails around?
Don't think so. Hence it is snake oil just like dozens of harnesses.
It might behave differently than specified and a human is required to validate every output carefully or else.
Well, what is your definition of "super reliable in the output", and is it a quantifiable/measurable target or just a feeling?
Is it "more than humans", "more than senior developers", "almost perfect", "perfect"?
> It might behave differently than specified and a human is required to validate every output carefully or else.
Sure, just like meatbag developers. All the security flaws AI finds today were introduced years/decades ago by humans and haven't been found (that we know) by humans in ages.
Between ten thousand runs of:
``` const int MAX_COUNT = 10000;
printf("I'll count up to %d", MAX_COUNT); for(int i=1; < MAX_COUNT; i++) printf("I'm now counting %d", i); ```
And of the following prompt:
``` You'll count to 10,000. At the start say "I'll count up to 10,000" and then for each number say "I'm now counting <number>" and do not say anything else. Do not miss numbers in between. ```
Which one is going to produce 100% correct results out of a 10,000 run of each?
Now don't give me "these are different tools". We all know. I'm talking about reliability and predictability.
Claude's initial approach was really poor. One has to wonder how many times Claude code has to be modified/reviewed for improvement, or whether it is possible at all to make good code with it.
Edited: Generalization: Claude can fix a localized, identifiable poor decision (e.g., "only reading first 40 lines") because the fault is discrete and traceable to one piece of code.
But real software quality problems often arise from many small, individually reasonable decisions that collectively produce bad outcomes. No single one is obviously "the fault." In that scenario, a tool that generates low-quality building blocks piecemeal may never converge on good code, because each piece seems fine in isolation.
Yes, to a (real) fault. Less than one in fifty times it ignores an instruction or piece of data in a file has it seen the instruction or data before ignoring it. The other times, it's done this sampling nonsense.
Results are night and day using the 1M token models and reading the full files.
I tried defining CLAUDE.md (or AGENTS.md), skills, plugins, but I'm not getting the effectiveness others claim to be. LSP plugin for example, CC doesn't to use LSP's symbol renaming and edits file one by one slowly, or it does not invoke the skill when I explicitly ask to remember to invoke when prompt contains a specific clue.
Am I using it wrong? Is there a robust example I can copy the harness?
"If A, do X. Do B,C,D. Do A" - and it just never uses X because "it forgot".
You just cant trust that the time you spend building rules will actually pay off, in fact you can trust that it will fail you sooner or later.
RAG, Harness, Skills... all was supposed to fix this, but in reality it never had.
I stopped using `/init` and having CLAUDE|AGENTS.md files that explained the codebase. The only thing I kept was how it should explore the codebase and use `git log` when researching, which is probably redundant too. I can't figure it out either.
The codebase I work on is roughly 100k LOC so idk if it is considered large. Personally it's the largest repo I have worked on.
- runs the test what is failing | grep "x|failing" | tail 10
- runs the test again to get the why it's failing message | tail 10
- runs the test again because tail 10 cut off the message
every fucking time.
I have a skill for it to not do that = save output for whatever test you run into file, read from file using whatever commands you want. Ignores the skill. It's maddening. It's as if, puts on tinfoil hat, it's designed to waste your tokens, while eventually accomplishing its task
Although if you've ever used Claude's search tool, you'll be unsurprised that the team knows nothing about indexing.
How a company, whose primary product is text-based chat, doesn't allow users to easily perform text search on said chat is beyond comprehension.
What a strange comment for them to make. Why wouldn't I expect CC to work well with those languages? What languages would I associated it with? Python and Javascript?
I still say if this happens to you with AI tooling, that's both a failure on you and your org for giving a developer prod credentials that could nuke production resources. I don't think I've worked in a place that gave me this level of blind access.
Even now my prompt says the AI must verify the path of the files it intends to edit, and get permission before editing one file at a time and only after permission. I stop it from ignoring those rules once a day at least.
Alas, humans suck at constant vigilance, we're built to avoid it whenever possible, so a "reverse centaur" future of "do what the AI says but only if you see it's good" is going to suck.
Indeed it’s a good practice to use roles where supported (AWS has them) and explicitly switch when needed
If you use SSO and have an AWS config that Claude is allowed to see to get the correct role in the first place, it will just pick the role and plough on anyway.
I think the model we've got now is wrong, and the harnesses should be OS-level sandboxed, and the agents should be running in harness managed sandboxes.
Then next I've used AI agents like crazy, we even have linked mcp servers that let it query on the dev database. Haven't seen it try deleting everything a single time. I haven't seen any agent try to do anything destructive. Ever. Perhaps its just reflecting an outrageously bad engineer and nothing else.
the fishing: 1) install the official `skill-creator`; 2) use that with the above link to create `claude-md-improver`; 3) improve the skill by tasking claude with researching the topic of `progressive-disclosure`, in the official docs; 4) point the new skill at you CLAUDE.md file and accept the changes
A stock Unreal Engine project is several hundred gigs, consists of multiple solutions, multiple languages, and I would classify as large personally.
Without some kind of indexing it’s very awkward to work with and very slow. To work with LLMs and Unreal projects we create a local index, that index file alone is 46GB.
Without distributed compilers and caches it can take multiple hours to compile the main solution per platform (usually PC, Linux, Xbox, PlayStation, Switch, and sometimes mobile).
So the codebase easily fits on local storage so long as you don’t count assets (those are several TB) and extra so for source assets (10s of TB), and that’s per stream per large project.
Anyways, point is I disagree and think Unreal Engine is an example of large codebase that fits locally.
Listen, I am a rails developer, so a monolith doesn’t scare me, and yet, there are limits. Why does it need to be a multi terabyte monolith?
People usually mention git-lfs at this point, but that is always annoying to use in practice. There is also shallow-clones and sparse-checkouts, but these only mitigate the problem as there is no way around cloning at least one revision completely with git.
You need a code dependency graph: https://github.com/roboticforce/remembrallmcp Ask "what breaks if I change this?"
Saves 98% token usage. Saves 95% tools being called.
Runs as an MCP server, works for 8 languages.
It just works, you need to try it.
So, if I've read this post correctly, that means that CC is navigating my codebase today by sending lots of it up to a model, and building an understanding. Is that correct? Did I misunderstand it?
I kinda suspected there was more local inference going on somehow -- partly because the iteration times are fairly fast.
It turns out, that for a machine, find and grep is all that's required.
The article really does not align with the current sentiment. Everyone with a choice has mostly moved on to codex (ofc in this world all it takes is a model update/harness update to turn things around).
CC is great at a lot of things, but repeatedly misses out reading on crucial parts of the code base, hallucinates on the work that was done and a bunch of other issues.
They want you feel like you’re missing out. They want you to switch. Being boring is far more productive. Pin your versions. Stick to stable releases and avoid the nightlies.
Significant noise created from 4.6 to 4.7 Opus transition has caused some to interpret this as signal. Excluding certain genuine and real bugs, the noise about perceived quality falling dramatically was noise. Influencers doing influencing turned it into “signal”. The reality was that if you had strong planning and spec driven development it ranged from manageable to non-existent.
The vast majority of the people I know and work with have not switched off CC or their Max sub.
But I may not have paid enough to get the full real experience with codex
At work it's CC or sometime codex, personally don't see much difference at all and most normies will notice none. The cultists have their opinions.
What bleeding? Anthropic wants as much of that "bleeding" as possible. The interaction data gathered from genuine human CC subscription usage of their models goes directly into their RL training, it's invaluable and they are more than happy to lose money on the inference to get it. That data is what xAI was recently willing to pay $10b to cursor to get.
They want you to use Claude Code. They hate other UI surfaces like OpenCode etc purely because they lose control over that data, so they're subsidizing the inference without getting what they actually want, the data (they still get some of it of course, but it's much less ergonomic for them. Those tools often abstract away the subagent calls, for example). OpenCode can collect that data themselves, so by allowing subscription there, Anthropic sees itself as subsidizing another org getting that data. Hard no.
And tools like OpenClaw are useless because they're mechanical and don't represent actual users interacting with the service - again, subsidizing but not getting the reward.
It's all very simple once you understand their motivations.
Ha!
As the product they deliver is greenfield and in the newest of domain spaces, there is a serious halo-effect to consider.
On a side note, at a company I know the devs are split between
Stick to Copilot inside visual studio
- suspiciously cheap Opus quotas there
+ they read their code
pi coding agent
+ control all the things my way
- each their own way
Claude Code
+ it's magic
- you mean it did that to my prompt!
Btw the guy in charge of that stuff for Anthropic is the same guy who said GPT 2 was too dangerous to release, Jack Clark. LMAO. That model could barely string a sentence together.
You are deep in an information bubble, mostly driven by hype-train influencers with magpie attention spans.
Are there any much more detailed walkthroughs of how it works and how it decides the tools to use and the grep to use etc and what the conversations actually look like?
In the UI you see just enough to know it’s doing something but you don’t really see the jumps it’s making offscreen.
I mean: If there was something you could add to the prompt to consistently increase performance why isn't it in the system prompt already?
If it's all about clarifying a couple of local idiosyncrasies, shouldn't it be able to quickly get them by looking through the repo?
Does anyone have an example of a CLAUDE.md that really makes a difference for them?
In general, this article would really have profited massively from examples of good applications of those patterns.
A specific example in another project is the testing/verification procedure. It's a wasm/WebGPU and the test harness is fairly complex. There are scripts to handle it, but by default Claude will churn for a while to figure it out and sometimes just give up. It definitely saves a lot of tokens/speeds things up.
I think about this a lot. So far I think we are mostly just being gaslit. That we can influence the AI to be better with a few encouraging words and role playing, actually seems absurd. Maybe there is some element of randomness introduced there or something. All these extra MD files don't seem to do nearly as much for results as people believe they do.
† swe with practical experience, a code wrangler if you will
A post like this should be providing people with some reassurance about Claude's ability to understand code at a large scale. It's mostly fluff.
Edit: so I did some googling to dig around for thoughts on LSP performance and integration. the author of bun has a tweet about saying that they are a big drag on performance for no real gain and virtually all of the replies agree. Anyone else have any experience/thoughts?
My criticism is fair. This is not an engineering blog post, it's purely marketing.
try this instead: https://anthropic.com/engineering
replying to the question introduced with your edit at the root: mcp servers tend to inject too many tokens into the context, that might not be relevant for the task at hand. ex: sentry's mcp is useful if you are collecting context for a bug in production, but is hardly useful later when fixing the bug, at that point you'd probably want treesitter, or if you are working on a new feature you might want to pull details from github issues, or jira tickets.
the consensus seems to land on making the right tools available to the agent, and let it pick which ones to use for the task, this typically means cli tools like git, gh, linters, a cli for your cloud/hosting provider, etc.
that is where skills come in: they bootstrap some context about a task the operator wants to complete. skills use progressive disclosure — each Read() adds more context, but the agent controls what to load-up and what to ignore, and skills can also come with scripts that facilitate actions relevant to the task.
Meanwhile we are still waiting for these statements to come true:
https://eu.36kr.com/en/p/3648851352018565
https://www.businessinsider.com/anthropic-ceo-ai-90-percent-...
https://www.reddit.com/r/Anthropic/comments/1nemhxb/futurism...
https://medium.com/@coders.stop/dario-amodei-said-90-of-code...
https://www.youtube.com/shorts/0j1HqEEDThc
Accountability, anyone?
"AI will take over almost all the work of software engineers (SWEs) end - to - end in just 6 - 12 months!"
What you describe is >50% of the job of SWEs, even when they write all code by hand.
Are you saying that "for many start-ups", this isn't done by SWE's but by some other career type or are you implying that it's just the code written (and first review) is replaced by AI?
He did say a few months later in an interview in India that AI will eventually take over most of SWE tasks.
—-
My statement on startups is largely about automating coding by SWEs. My startup also uses AI to automate part of technical specifications and code review but I am not sure how widespread that is.
const sqlStatement = (!params.mostRecentOnly) ? {giant SQL statement} : {identical giant SQL statement + 'LIMIT 1' at the end}
AI never met a problem that can't be solved with more code. Need some data in a slightly different structure? Don't try to modify an existing endpoint, just build a new one! Need to access a field that's buried in a JSON object in the database? Just create a new column, but don't bother removing the field from the JSON object. The more sources of truth, the merrier! When it comes time to update, just write more code to update the field everywhere it lives!
Factor out the extra sources of truth you say? Good luck scanning the most verbose front-end you've ever seen to make sure nothing is looking at the source you want to remove. In the beginning of big projects, you have to be absolutely ruthless about keeping complexity down so it doesn't get out of control later. AI is terrible at keeping complexity down.
My goal is to halve the lines of code from what the vendor turned over to us. One baby step at a time.
UI components always presentational only logic abstracted modularly, etc...
The number of times I’ve seen Claude say “this test was failing already so is ignored” when it _wasnt_ despite me telling it to never do that makes me doubt.
Then again I don't want to pay for AI to give me the coding style of the worst I ever worked with either.
which startups? I'm genuinely curious
Companies will definitely expect devs to ship more with the same headcount, oftentimes either won’t hire juniors to train them up or will straight up do layoffs, sometimes the AI just being a convenient scapegoat. We kind of can’t ignore that either, sure a lot of those companies will be shooting themselves in the foot, but livelihoods will be impacted a bunch.