It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.
It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.
I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.
It probably still saved time compared to making all the changes myself. But it was way more frustrating.
I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.
This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.
as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.
Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.
as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.
This is the true power of agent teams: https://code.claude.com/docs/en/agent-teams
You maintain very low context usage in the main thread; just orchestration and planning details, while each individual team member remains responsible for their own. Allows you to churn through millions of output tokens in a fraction of the time.
There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.
Things have changed. The models have reached a level of coherence that they can be left to make the right decisions autonomously. Opus 4.6 is in a class of its own now.
The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.
But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.
> Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.
which matches my experience exactly. I consider it to be about as magical as the parent comment is claiming, but I wouldn’t call it totally automatic.
Definitely not ideal, but sure helps.
You need to converge on the requirements.
To echo what the parent comment said, it's almost frustrating how effective it can be at certain tasks that I wouldn't ever have the patience for. At my job recently I needed to prototype calling some Python code via WASM using the Rust wasmtime engine, and setting up the code structure to have the bytes for the WASM component, the arguments I wanted to pass to the function, and the WIT describing the interface for the function, it was able to fill in all of the boilerplate needed so that the function calls worked properly within a minute or two on the first try; reading through all the documentation and figuring out how exactly which half dozen assorted things I had to import and hook up together in the correct order would have probably taken me an hour at minimum.
I don't have any particular insight on whether or not these tools will become even more powerful over time, and I still have fairly strong concerns about how AI tools will affect society (both in terms of how they're used and the amount of in energy used to produce them in the first place), but given how much the tech industry tends to prioritize productivity over social concerns, I have to assume that my future employment is going to be heavily impacted by my willingness to adopt and use these tools. I can't deny at this point that having it as an option would make me more productive than if I refuse to use it, regardless of my personal opinions on it.
Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.
I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.
Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.
After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.
When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.
After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.
We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.
Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.
I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.
- I ask for something highly general and claude explores a bit and responds.
- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.
- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.
- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.
- I ask it to review it's own work and then critique that review, etc. We write tests.
Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.
You can minimize these problems with TLC but ultimately it just will keep fucking up.
This has made my planning / research phase so much better.
I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.
What’s an example of something you built today like this?
Huh? The max plan is $200/month. How are you spending $75 in 4 hrs?
I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.
And yes people have had success with very simple prototypes and demos at work.
I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.
Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.
The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.
Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.
I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.
Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.
GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).
So on the latest models for my use case gpt > opus but these change all the time.
Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.
Kinda funny how you don't actually need to use coercion if you put in the engineering work to build a product that's competitive on its own technical merits...
If you're not using AI you are cooked. You just don't realize it yet.
Truth. But not just “using”.
Because here’s where this ship has already landed: humans will not write code, humans will not review code.
I see mostly rage against this idea, but it is already here. Resistance is futile. There will be no “hand crafted software” shops. You have at most 3-4 years left if you think this is your job.
People should still understand the code because sometimes the AI solution really is wrong and I have to shove my hand in it's guts and force it to use my solution or even explain the reasoning.
People should be studying architecture. Cause now I can orchestrate stuff that used to take teams and I would throwaway as a non-viable idea. Now I can just do it. But no you will still be reviewing code.
All programming is like this to some extent, but Claude's 80/20 behavior is so much more extreme. It can almost build anything in 15-30 minutes, but after those 15-30 minutes are up, it's only "almost built". Then you need to spend hours, days, maybe even weeks getting past the "almost".
Big part of why everyone seems to be vibe coding apps, but almost nobody seems to be shipping anything.
I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.
Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.
The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.
Is it Baader-Meinhof or is everyone on HN suddenly using obscure acronyms?
[0] https://en.wikipedia.org/wiki/Software_requirements_specific...
[1] https://news.ycombinator.com/item?id=47323316 who the hell knows that version of "RSI"?
1000% agree. It's also easy to talk to it about something you're not sure it said and derive a better, more elegant solution with simple questioning.
Gemini 3.1 also gives me these vibes.
Super simple problem :
I had a ZMK keyboard layout definition I wanted it to convert it to QMK for a different keyboard that had one key less so it just had to trim one outer key. It took like 45 minutes of back and forth to get it right - I could have done it in 30 min manually tops with looking up docs for everything.
Capability isn't the impressive part it's the tenacity/endurance.
It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.
I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.
I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6
Horizontal parallelising of tasks doesn't really require any modern tech.
But I agree that Opus 4.6 with 1M context window is really good at lots of routine programming tasks.
Spent an hour or so unraveling the mess. My feeling are growing more and more conflicted about these tools. They are here to stay obviously.
I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience. It’s like the future is a world where the entire programming sphere is dominated by the clueless non technical management that we’ve all had to deal with in small proportion a time or two.
Well, (economic) progress means being able to do more with less. A Fordian-style conveyor belt factory can churn out cars with relatively unskilled labour.
Economising on human capital is economising on a scarce input.
We had these kinds of shifts before. Compare also how planes used to have a pilot, copilot and flight engineer. We don't have that anymore, but it used to be a place for people to learn. But pilot education has adapted.
Or check how spreadsheet software has removed a lot of the worst rote work in finance. That change happened perhaps in the 1980s. Finance has adapted.
> Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.
Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.
This always felt like a reason to throw it at coding. With its rigid syntax you'll know quickly and cheaply if what was written passes an absolute minimaal level of quality.
Sounds like it is.
That being said it's the only use case for me. I won't subscribe to something that I can't use with third party harness.
Not sure if this means I should get a more interesting job or if we are all going to be at the mercy of UBI eventually.
RIP widespread middle class. It was a good 80-year run.
We know this to be true with a reasonably degree of certainty
> likely a society
This one, not so much. We could potentially have pretty vibrant societies even if everyone is not ultra rich, not going on international vacations, not having access to buy things from the other end of the world subsidized by economies of scale.
You likely wouldn't need money at all in that future, for example. What does the money really mean when everyone I'd guaranteed to have all the basics covered? Is money really helping to store value created via labor when there is no labor? And is money providing price discover when the cost of resources and manufacturing are moving towards zero?
If labor is replaced with tech, and I think that's a big if, I don't see any outcome other than a totalitarian distopia that will fail much like the Soviet Union.
Sure I'm talking the future so its speculative, but I'd love to hear a scenario where it works well sustainably and doesn't turn into a totalitarian distopia.
I had Opus 4.6 tell me I was "seeing things wrong" when I tried to have it correct some graphical issues. It got stuck in a loop of re-introducing the same bug every hour or so in an attempt to fix the issue.
I'm not disagreeing with your experience, but in my experience it is largely the same as what I had with Opus 4.5 / Codex / etc.
It started by insisting I was repeatedly making a typo and still would not budge even after I started copy/pasting the full terminal history of what I was entering and the unabridged output, and eventually pivoted to darkly insinuating I was tampering with my shell environment as if I was trying to mislead it or something.
Ultimately it turned out that it forgot it was supposed to be applying the fixes to the actual server instead of the local dev environment, and had earlier in the conversation switched from editing directly over SSH to pushing/pulling the local repo to the remote due to diffs getting mangled.
I'm late to the party and I'm just getting started with Antrophic models. I have been finding Sonnet decent enough, but it seems to have trouble naming variables correctly (it's not just that most names are poor and undescriptive, sometimes it names it wrong, confusing) or sometimes unnecessarily declaring, re-declaring variables, encoding, decoding, rather than using the value that's already there etc. Is Opus better at this?
Also shout out to beads - I highly recommend you pair it with beads from yegge: opus can lay out a large project with beads, and keep track of what to do next and churn through the list beautifully with a little help.
The amount of genuine fuck-ups Codex finds makes me skeptical of people who are placing a lot of trust in Claude alone.
Not even close. There are still tons of architectural design issues that I'd find it completely useless at, tons of subtle issues it won't notice.
I never run agents by themselves; every single edit they do is approved by me. And, I've lost track of the innumerable times I've had to step in and redirect them (including Opus) to an objectively better approach. I probably should keep a log of all that, for the sake of posterity.
I'll grant you that for basic implementation of a detailed and well-specced design, it is capable.
There's probably more examples, but to me AGI must move beyond the above issues. Though frankly context window might just be a symptom of poor harness than anything, still - it illustrates my general issue with them being considered AGI as it stands today.
Claude 4.6 is getting crazy good though, i'll give you that.
1. Click in the bar at the top of the page that says ycombinator.com 2. type this in: youtube.com 3. press enter 4. There will be a box at the top that says "search", click that 5. type in "tips and tricks for agentic coding" 6. press enter 7. a list of videos should appear, watch them all
The shift I've noticed: 1M context makes "load the whole codebase once, run many agents" viable, whereas before you were constantly re-chunking and losing context. The per-task cost goes up but the time-to-correct-output drops significantly.
The harder problem for most teams is routing — knowing which tasks actually need Opus at 1M vs. Sonnet at 200k. Opus 4.6 at 1M is overkill for 80% of coding tasks. The ROI only works if you're being intentional about when to use it.