Sure, for 4/5 interactions then will ignore those completely :)
Try for yourself: add to CLAUDE.md an instruction to always refer to you as Mr. bcherny and it will stop very soon. Coincidentally at that point also loses tracks of all the other instructions.
(Similar guidance goes for writing tools & whatnot - give the LLM exactly and only what it needs back from a tool, don’t try to make it act like a deterministic program. Whether or not they’re capital-I intelligent, they’re pretty fucking stupid.)
I also use Github Copilot which is just $10/mo. I have to use the official copilot though, if I try to 'hack it' to work in Claude Code it burns thru all the credits too fast.
I am having a LOT of great luck using Minimax M2 in Claude Code, its very cheap, and it works so good.. its close to Sonnet in Claude Code. I use this tool called cc-switch to swap out different models for Claude Code.
(I just learned ChatGPT 5.2 Pro is $168/1mtok. Insanity.)
If Claude makes a yawn or similar, I know it’s parsed the files. It’s not been doing so the last week or so, except for once out of five times last night.
“You’re absolutely right! I see here you don’t want me to break every coding convention you have specified for me!”
I've used it pretty extensively over the year and never had issues with this.
If you hit autocompact during a chat, it's already too long. You should've exported the relevant bits to a markdown file and reset context already.
I think you may be observing context rot? How many back and forths are you into when you notice this?
Real semi-productive workflow is really a "write plans in markdowns -> new chat -> implement few things -> update plans -> new chat, etc".
I'm sure there are workarounds such as resetting the context, but the point is that god UX would mean such tricks are not needed.
Some things I found from my own interactions across multiple models (in addition to above):
- It's basically all about the importance of (3). You need a feedback loop (we all do). and the best way is for it to change things and see the effects (ideally also against a good baseline like a test suite where it can roughly guage how close or far it is from the goal.) For assembly, a debugger/tracer works great (using batch-mode or scripts as models/tooling often choke on such interactivie TUI io).
- If it keeps missing the mark tell it to decorate the code with a file log recording all the info it needs to understand what's happening. Its analysis of such logs normally zeroes the solution pretty quickly, especially for complex tasks.
- If it's really struggling, tell it to sketch out a full plan in pseudocode, and explain why that will work, and analyze for any gotchas. Then to analayze the differences between the current implementation and the ideal it just worked out. This often helps get it unblocked.
I couldn't agree more. And using Plan mode was a major breakthrough for me. Speaking of Plan Mode...
I was previously using it repeatedly in sessions (and was getting great results). The most recent major release introduced this bug where it keeps referring back to the first plan you made in a session even when you're planning something else (https://github.com/anthropics/claude-code/issues/12505).
I find this bug incredibly confusing. Am I using Plan Mode in a really strange way? Because for me this is a showstopper bug–my core workflow is broken. I assume I'm using Claude Code abnormally otherwise this bug would be a bigger issue.
So you either need to be very explicit about starting a NEW plan if you want to do more than one plan in a session, or close and start a new session between plans.
Hopefully this new feature will get less buggy. Previously the plan was only in context and not written to disk.
For example making a computer use agent… Made the plan, implementation was good, now I want to add a new tool for the agent, but I want to discuss best way to implement this tool first.
Clearing context means Claude forgets everything about what was just built.
Asking to discuss this new tool in plan mode makes Claude rewrite entire spec for some reason.
As workaround, I tell Claude “looks good, delete the plan” before doing anything. I liked the old way where once you exit plan mode the plan is done, and next plan mode is new plan with existing context.
Now it has a file it can refer to (call it "memory" to be fancy) without having to keep everything in context. The plan in the file survives over autocompact a lot better and it can just copy it to the project directory without rewriting it from memory.
I compared both with the same set of prompts and Claude Code seemed to be a senior expert developer and Jules, well don't know who be that bad ;-)
Anyway, I also wanted to have persistent information, so I don't have to feed Claude Code the same stuff over and over again. I was looking for similar functionality as Claude projects. But that's not available for Claude Code Web.
So, I asked Claude what would be a way of achieving pretty the same as projects, and it told me to put all information I wanted to share in a file with the filename:.clinerules. Claude told me I should put that file in the root of my repository.
So please help me, is your recommendation the correct way of doing this, or did Claude give the correct answer?
Maybe you can clear that up by explaining the difference between the two files?
I feel like when I do plan mode (for CC and competing products), it seems good, but when I tell it to execute the output is not what we planned. I feel like I get slightly better results executing from a document in chunks (which of course necessitates building the iterative chunks into the plan).
Profilerating .md files need some attention though.
yes the executor only needs the next piece of the plan.
I tend to plan in an entirely different environment, which fits my workflow and has the added benefit of providing a clear boundary between the roles. I aim to spend far more time planning than executing. if I notice getting more caught up in execution than I expected, that's a signal to revise the plan.
You can also use it in conjunction with planning mode—use the documents to pin everything down at a high-to-medium level, then break off chunks and pass those into planning mode for fine-grained code-level planning and a final checking over before implementation.
https://gist.github.com/a-c-m/f4cead5ca125d2eaad073dfd71efbc...
That will moves stuff that required manually clarifying back into the claude.md (or a useful subset you pick). It does a much better job of authoring claude.md than I do.
> I add to my team’s CLAUDE.md multiple times a week.
How big is that file now? How big is too big?I am currently working on a new slash command /investigate <service> that runs triage for an active or past incident. I've had Claude write tools to interact with all of our partner services (AWS, JIRA, CI/CD pipelines, GitLab, Datadog) and now when an incident occurs it can quickly put together an early analysis of a incident finding the right people to involve (not just owners but people who last touched the service), potential root causes including service dependency investigations.
I am putting this through it's paces now but early results are VERY good!
Ours is maybe half that size. We remove from it with every model release since smarter models need less hand-holding.
You can also break up your CLAUDE.md into smaller files, link CLAUDE.mds, or lazy load them only when Claude works in nested dirs.
And thank you for your work!! I focus all of my energy on helping families stay safe online, I make educational content and educational products (including software). Claude Code has helped me amplify my efforts and I’m able to help many more families and children as a result. The downstream effects of your work on Claude Code are awesome! I’ve been in IT since 1995 and your tools are the most powerful tools I’ve ever used, by far.
This is the meat of it:
## Code Style (See JULIA_STYLE.md for details)
- Always use explicit `return` statements
- Use Float32 for all numeric computations
- Annotate function return types with `::`
- All `using` statements go in Main.jl only
- Use `error()` not empty returns on failure
- Functions >20 lines need docstrings
## Do's and Don'ts
- Check for existing implementations first
- Prefer editing existing files
- Don't add comments unless requested
- Don't add imports outside Main.jl
- Don't create documentation unless requested
Since Opus 4.0 this has been enough to get it to write code that generally follows our style, even in Julia, which is a fairly niche language.If you wouldn't mind answering a question for me, it's one of the main things that has made me not add claude in vscode.
I have a custom 'code style' system prompt that I want claude to use, and I have been able to add it when using claude in browser -
``` Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea.
Trust the context you're given. Don't defend against problems the human didn't ask you to solve. ```
How can I add it as a system prompt (or if its called something else) in vscode so LLMs adhere to it?
One other feature with CLAUDE.md I’ve found useful is imports: prepending @ to a file name will force it to be imported into context. Otherwise, whether a file is read and loaded to context is dependent on tool use and planning by the agent (even with explicit instructions like “read file.txt”). Of course this means you have to be judicial with imports.
Now I can do it with Claude within minutes, while watching my TV shows on the second monitor and get directly to the good bits, the actual "business logic" of whatever I'm building.
And as for old, I'm 47. I've been programming since I got my first C64 in 1985.
Same here, 47 and got my start programming on a Commodore 64. What’s up, brother?
Context switching between AI-assisted coding and "oops, my tool is refusing to function, guess I'll stop using it" is often worse for productivity than never using the AI to begin with.
You can do this with any agentic harness, just plain prompting and "LLM management skills". I don't have Claude Code at work, but all this applies to Codex and GH Copilot agents as well.
And agreed, Opus 4.5 is next level.
My current understanding is that it’s for demos and toy projects
Don't get me wrong, AI is at least as game-changing for programming as StackOverflow and Google were back in the day. I use it every day, and it's saved me hours of work for certain specific tasks [2]. But it's simply not a massive 10x force multiplier that some might lead you to believe.
I'll start believing when maintainers of complex, actively developed, and widely used open-source projects (e.g. ffmpeg, curl, openssh, sqlite) start raving about a massive uptick in positive contributions, pointing to a concrete influx of high-quality AI-assisted commits.
[0] https://mikelovesrobots.substack.com/p/wheres-the-shovelware...
There is. We had to basically create a new category for them on /r/golang because there was a quite distinct step change near the beginning of this year where suddenly over half the posts to the subreddit were "I asked my AI to put something together, here's a repo with 4 commits, 3000 lines of code, and an AI-generated README.md. It compiles and I may have even used it once or twice." It toned down a bit but it's still half-a-dozen posts a day like that on average.
Some of them are at least useful in principle. Some of them are the same sorts of things you'd see twice a month, only now we can see them twice a week if not twice a day. The problem wasn't necessarily the utility or the lack thereof, it was simply the flood of them. It completely disturbed the balance of the subreddit.
To the extent that you haven't heard about these, I'd observe that the world already had more apps than you could possibly have ever heard about and the bottleneck was already marketing rather than production. AIs have presumably not successfully done much about helping people market their creations.
There was a GitHub PR on the ocaml project where someone crafted a long feature (mac silicon debugging support). The pr was rejected because nobody wanted to read it for it was too long. Seems to me that society is not ready for the width of output generated this way. Which may explain the lack of big visible change so far. But I already see people deploying tiny apps made by Claude in a day.
It's gonna be weird...
Context: This news story https://news.ycombinator.com/item?id=44180533
Or could it be, after the growth and build, we are in maintenance mode and we need less people?
Just food for thought
Two years and 3/4 will be not needed anymore
People think they'll have jobs maintaining AI output but i don't see how maintaining is that harder than creating for a llm able to digest requirements and codebase and iterate until a working source runs.
Back then, we put all the source code into AI to create things, then we manually put files into context, now it looks for needed files on their own. I think we can do even better by letting AI create a file and API documentation and only read the file when really needed. And select the API and documentation it needs and I bet there is more possible, including skills and MCP on top.
So, not only LLMs are getting better, but also the software using it.
I see it as a competent software developer but one that doesn't know the code base.
I will break down the tasks to the same size as if I was implementing it. But instead of doing it myself, I roughly describe the task on a technical level (and add relevant classes to the context) and it will ask me clarifying questions. After 2-3 rounds the plan usually looks good and I let it implement the task.
This method works exceptionally well and usually I don't have to change anything.
For me this method allows me to focus on the architecture and overall structure and delegate the plumbing to Copilot.
It is usually faster than if I had to implement it and the code is of good quality.
The game changer for me was plan mode. Before it, with agent mode it was hit or miss because it forced me to one shot the prompt or get inaccurate results.
I know what you mean, but the thing I find windsurf (which we moved to from copilot) most useful (except writing opeanapi spec files) is asking it questions about the codebase. Just random minutiae that I could find by grepping or following the code, but would take me more than the 30s-1m it takes it. For reference, this is a monorepo of a bit over 1M LoC (and 800k YAML files, because, did I mention I hate API specs?), so not really a small code base either.
> I will break down the tasks to the same size as if I was implementing it. But instead of doing it myself, I roughly describe the task on a technical level (and add relevant classes to the context) and it will ask me clarifying questions. After 2-3 rounds the plan usually looks good and I let it implement the task.
Here I disagree, sort of. I almost never ask it to do complex tasks, the most time consuming and hardest part is not actually typing out the code, describing it to an AI takes me almost as much time as implementing for most things. One thing I did find very useful is the supertab feature of windsurf, which, at a high level, looks at the changes you started making and starts suggesting the next change. And it's not only limited to repetitive things (like . in vi), if you start adding a parameter to a function, it starts adding it to the docs, to the functions you need below, and starts implementing it.
> For me this method allows me to focus on the architecture and overall structure and delegate the plumbing to Copilot.
Yeah, a coworker said this best, I give it the boring work, I keep the fun stuff for myself.
I described my workflow that has been a game changer for me, hoping it might be useful to another person because I have struggled to use LLMs for more than a Google replacement.
As an example, one task of the feature was to add metrics for observability when the new action was executed. Another when it failed.
My prompt: Create a new metric "foo.bar" in MyMetrics when MyService.action was successful and "foo.bar.failed" when it failed.
I review the plan and let it implement it.
As you can see it's a small task and after it is done I review the changes and commit them. Rinse and repeat.
I think the biggest issue is that people try to one shot big features or applications. But it is much more efficient to me to treat Copilot as a smart pair programming partner. There you also think about and implement one task after the other.
Here's an already out of date and unfinished blog post about it: https://williamcotton.com/articles/introducing-web-pipe
Here's a simple todo app: https://github.com/williamcotton/webpipe/blob/webpipe-2.0/to...
Check out the BDD tests in there, I'm quite proud of the grammar.
Here's my blog: https://github.com/williamcotton/williamcotton.com/blob/mast...
It's got an LSP as well with various validators, jump to definitions, code lens and of course syntax highlighting.
I've yet to take screenshots, make animated GIFs of the LSP in action or update the docs, sorry about that!
A good portion of the code has racked up some tech debt, but hey, it's an experiment. I just wanted to write my own DSL for my own blog.
The app is definitely still a bit rough around the edges but it was developed in breakneck speed over the last few months - I've probably seen an overall 5x acceleration over pre-agentic development speed.
I have a React application where the testing situation is FUBAR, we are stuck on an old version of React where tests like enzyme that really run react are unworkable because the test framework can never know that React is done rendering -- working with Junie I developed a style of true unit tests for class components (still got 'em) that tests tricky methods in isolation. I have a test file which is well documented explaining the situation around tests and ask "Can we make some tests for A like the tests in B.test.js, how would you do that?" and if I like the plan I say "make it so!" and it does... frankly I would not be writing tests if I didn't have that help. It would also be possible to mock useState() and company and might do that someday... It doesn't bother me so much that the tests are too tightly coupled because I can tell Junie to fix or replace the tests if I run into trouble.
For me the key things are: (1) understanding from a project management perspective how to cut out little tasks and questions, (2) understanding enough coding to know if it is on the right track (my non-technical boss has tried vibe coding and gets nowhere), (3) accepting that it works sometimes and sometimes it doesn't, and (4) recognizing context poisoning -- sometimes you ask it to do something and it gets it 95% right and you can tell it to fix the last bit and it is golden, other times it argues or goes in circles or introduces bugs faster than it fixes them and as quickly as you can you recognize that is going on and start a new session and mix up your approach.
These navbars are similar but not the same, both have a pager but they have other things, like one has some drop downs and the other has a text input. Styled "the same" means the line around the search box looks the same as the lines around the numbers in the pager, and Junie got that immediately.
In the end the patch touched css classes in three lines of one file and added a css rule -- it had the caveat that one of the css classes involved will probably go away when the board finally agrees to make a visual change we've been talking about for most of a year but I left a comment in the first navbar warning about that.
There are plenty of times I ask Junie to try to consolidate multiple components or classes into one and it does that too as directed.
You don't just YOLO it. You do extensive planning when features are complex, and you review output carefully.
The thing is, if the agent isn't getting it to the point where you feel like you might need to drop down and edit manually, agents are now good enough to do those same "manual edits" with nearly 100% reliability if you are specific enough about what you want to do. Instead of "build me x, y, z", you can tell it to rename variables, restructure functions, write specific tests, move files around, and so on.
So the question isn't so much whether to use an agent or edit code manually—it's what level of detail you work at with the agent. There are still times where it's easier to do things manually, but you never really need to.
And it makes sense. For most coding problems the challenge isn’t writing code. Once you know what to write typing the code is a drop in the bucket. AI is still very useful, but if you really wanna go fast you have to give up on your understanding. I’ve yet to see this work well outside of blog posts, tweets, board room discussions etc.
The few times I've done that, the agent eventually faced a problem/bug it couldn't solve and I had to go and read the entire codebase myself.
Then, found several subtle bugs (like writing private keys to disk even when that was an explicit instruction not to). Eventually ended up refactoring most of it.
It does have value on coming up with boilerplate code that I then tweak.
which might be fine if you're doing proof of concept or low risk code, but it can also bite you hard when there is a bug actively bleeding money and not a single person or AI agent in the house that knows how anything work
calling this snake oil is like when the horse carriage riders were against cars.
Understanding of the code in these situation is more important than the code/feature existing.
I think the reality is a lot of code out there doesn’t need to be good, so many people benefit from agents etc.
Agents make mistakes which need to be corrected, but they also point out edge cases you haven’t thought of.
This is negligence, it's your job to understand the system you're building.
We've been unfucking architecture done like that for a month after the dev that had hallucination session with their AI left.
This concerns me because fighting tooling is not a positive thing. It’s very negative and indicates how immature everything is.
Often the challenge is users aren't interacting with Claude Code about their rules file. If Claude Code doesn't seem to be working with you ask it why it ignore a rule. Often times it provides very useful feedback to adjust the rules and no longer violate them.
Another piece of advice I can give is to clear your context window often! Early in my start in this I was letting the context window auto compact but this is bad! Your model is it's freshest and "smartest" when it has a fresh context window.
@AGENTS.mdWhat a joke. Claude regularly ignores the file. It is a toss up: we were playing a game at work to guess which items will it forget first: to run tests, formatter, linter etc. This is despite items saying ABSOLUTELY MUST, you HAVE To and so long.
I have cancelled my Claude Max subscription. At least Codex doesn’t tell me that broken tests are unrelated to its changes or complain that fixing 50 tests is too much work.
This drives up price faster than quality though. Also increases latency.
They also recently lowered the price for Opus 4.5, so it is only 1.67x the price of Sonnet, instead of 5x for Opus 4.
I used to spend $200+ an hour on a single developer. I'm quite sure that benevolence was a factor when they submitted me an invoice, since there is no real transparency if I was being overbilled or not or that the developer acted in my best interest rather than theirs.
I'll never forget that one contractor who told me he took a whole 40 hours to do something he could have done in less than that, specifically because I allocated that as an upperbound weekly budget to him.
Do you ever feel bad for basically robbing these poor people blind? They're clearly losing so much money by giving you $1800 in FREE tokens every month. Their business can't be profitable like this, but thankfully they're doing it out of the goodness of their hearts.
I update my CLAUDE.md all the time and notice the effects.
Why all the snark?
If you're continually finding that it's being forgotten, maybe you're not starting fresh sessions often enough.
You can learn how to use it, or you can put it down if you think it doesn't bring you any benefit.
So are animals, but we've used dogs and falcons and truffle hunting pigs as tools for thousands of years.
Non-deterministic tools are still tools, they just take a bunch more work to figure out.
https://simonwillison.net/2025/Dec/10/html-tools/ is the 37th post in my series about this: https://simonwillison.net/series/using-llms/
https://simonwillison.net/2025/Mar/11/using-llms-for-code/ is probably still my most useful of those.
I know you absolutely hate being told you're holding them wrong... but you're holding them wrong.
They're not nearly as unpredictable as you appear to think they are.
One of us is misleading people here, and I don't think it's me.
Firstly, I am not the one with an LLM-influencer side-gig. Secondly - No sorry, please don't move the goalposts. You did not answer my main argument - which is - how does a "tool" which constantly change its behaviour deserve being called a tool at all? If a tailor had scissors which cut the fabric sometimes just a bit, and sometimes completely differently every time they used it, would you tell the tailor he is not using them right too? Thirdly you are now contradicting yourself. First you said we need to live with the fact that they are un-predictable. Now you are sugarcoating it into being "a bit unpredictable", or "not as nearly unpredictable". I am not sure if you are doing this intentionally or do you really want to believe in the "magic" but either way you are ignoring the ground tenets of how this technology works. I'd be fine if they used it to generate cheap holiday novels or erotica - but clearly after four years of experimenting with the crap machines to write code created a huge pushback in the community - we don't need the proverbial scissors which cut our fabric differently each time!
Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)
The same is true of plenty of other tools - pottery kilns, cast iron pans, knife sharpening stones. Expert tool users frequently use tools that change over time and need to be monitored and adjusted.
I do think dogs and horses other animal tools remain an excellent example here as well. They're unpredictable and you have to constantly adapt to their latest behaviors.
I agree that LLMs are unpredictable in that they are non-deterministic by nature. I also think that this is something you can learn to account for as you build experience.
I just fed this prompt to Claude Code:
Add to_text() and to_markdown() features to justhtml.html - for the whole document or for CSS selectors against it
Consult a fresh clone of the justhtml Python library (in /tmp) if you need to
It did exactly what I expected it would do, based on my hundred of previous similar interventions with that tool: https://github.com/simonw/tools/pull/162I wrote about another solid case study this morning: https://simonwillison.net/2025/Dec/14/justhtml/
I genuinely don't understand how you can look at all of this evidence and still conclude that they aren't useful for people who learn how to use them.
Now let's make the analogy more accurate: let's imagine the blast furnace often ignores the operator controls, and just did what it "wanted" instead. Additionally, there are no gauges and there is no telemetry you can trust (it might have some that can the furnace will occasionally falsify, but you won't know when it's doing that).
Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output (requires rework which exceeds the productivity gains of using the blast furnace to begin with).
Furthermore, the only way to tell which one of those 3 options you got, is to manually inspect every detail of every piece of every output. If you don't do this, the output might leak secrets (or worse) and bankrupt your company.
Finally, the operator would be charged for usage regardless of how often the furnace actually worked. At least this part of the analogy already fits.
What a weird blast furnace! Would anyone try to use this tool in such a scenario? Not most experienced metalworkers. Maybe a few people with money to burn. In particular, those who sing the highest praises of such a tool would likely be ignorant of all these pitfalls, or have a vested interest in the tool selling.
Absolutely wrong. If this blast furnace would cost a fraction of other blast furnaces, and would allow you to produce certain metals that were too expensive to produce previously (even with high error rate), almost everyone would use it.
Which is exactly what we're seeing right now.
Yes, you have to distinguish marketing message vs real value. But in terms of bang for buck, Claude Code is an absolute blast (pun intended)!
Totally incorrect: as we already mentioned, this blast furnace actually costs just as much as every other blast furnace to run all the time (which they do). The difference is only in the outputs, which I described in my post and now repeat below, with emphasis this time.
Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output ——>(requires rework which exceeds the productivity gains of using the blast furnace to begin with)<——
Does this describe any currently-operating blast furnaces you are aware of? Like I said, probably not, for good reason.
I couldn't agree more.
I did not say that. I said that most metalworkers familiar with all the downsides (only 1 of which you are referring to here) would avoid using such an unpredictable, uncontrollable, uneconomical blast furnace entirely.
A regular blast furnace requires the user to be careful. A blast furnace which randomly does whatever it wants from minute to minute, producing bad output more often than good, including bad output that costs more to fix than the furnace cost to run, more than any cost savings, with no way to tell or meaningfully control it, is pretty useless.
Saying "be careful" using a machine with no effective observability or predictability or controls is a silly misnomer, when no amount of care will bestow the machine with them.
What other tools work this way, and are in widespread use? You mentioned horses, for example: What do you think usually happens to a deranged, rabid, syphilitic working horse which cannot effectively perform any job with any degree of reliability, and which often unpredictably acts out in dangerous and damaging ways? Is it usually kept on the job and 'run carefully'? Of course not.
Wow, was that a shark just then?
Dogs learn their jobs way faster, more consistently and more expressively than any AI tool.
Trivially, dogs understand "good dog" and "bad dog" for example.
Reinforcement learning with AI tooling clearly seems not to work.
That doesn't match my experience with dogs or LLMs at all.
They fully understand their limitations. Users of accessibility technology are extremely good at understanding the precise capabilities of the tools they use - which reminds me that screenreaders themselves are a great example of unreliable tools due to the shockingly bad web apps that exist today.
I've also discussed the analogy to service dogs with them, which they found very apt given how easily their assistive tool could be distracted by a nearby steak.
The one thing people who use assistive technology do not appreciate is being told that they shouldn't try a technology out themselves because it's unreliable and hence unsafe for them to use!