upvote
You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:

- I ask for something highly general and claude explores a bit and responds.

- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.

- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.

- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.

- I ask it to review it's own work and then critique that review, etc. We write tests.

Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.

reply
I absolutely have the hang of Claude and I still find that it can make those ridiculous mistakes, like replicating logic into a test rather than testing a function directly, talking to a local pg that was stale/ running, etc. I have a ton of skills and pre-written prompts for testing practices but, over longer contexts, it will forget and do these things, or get confused, etc.

You can minimize these problems with TLC but ultimately it just will keep fucking up.

reply
My favorite is when you need to rebuild/restart outside of claude and it will "fix the bug" and argue with you about whether or not you actually rebuilt and restarted whatever it is you're working on. It would rather call you a liar than realize it didn't do anything.
reply
this is a pretty annoying problem -- i just intentionally solve it by asking claude to always use the right build command after each batch of modifications, etc
reply
"That's an old run, rebuild and the new version will work" lol
reply
With the back and forth refining I find it very useful to tell Claude to 'ask questions when uncertain' and/or to 'suggest a few options on how to solve this and let me choose / discuss'

This has made my planning / research phase so much better.

reply
Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).
reply
Pretty good workflow. But you need to change the order of the tests and have it write the tests first. (TDD)
reply
I mean I’ve been using AI close to 4 years now and I’ve been using agents off and on for over a year now. What you’re describing is exactly what I’m doing.

I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.

What’s an example of something you built today like this?

reply
> After about 4 hours and $75

Huh? The max plan is $200/month. How are you spending $75 in 4 hrs?

reply
Curious what language and stack. And have people at your company had marginally more success with greenfield projects like prototypes? I guess that’s what you’re describing, though it sounds like it’s a directory in a monorepo maybe?
reply
This was in Go, but my org also uses Typescript, and Elixir.

I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.

And yes people have had success with very simple prototypes and demos at work.

reply
Try https://github.com/gsd-build/get-shit-done. It's been a game changer for me.
reply
Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.

I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.

Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.

reply
Most devs aren't very good. That's the reality, it's what we've all known for a long time. AI is trained on their code, and so these "subpar" devs are blown away when they see the AI generate boring, subpar code.

The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.

Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.

I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.

reply
> people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away

Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.

reply