I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.
I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.
I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.
But different strokes for different folks, I suppose.
And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.
Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.
Another task used Opus and I manually specified the model to use. It still used the wrong model.
This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.
So if you are not actively monitoring your agent and making the corrections, you need something else that is.
Also instead of just prompting, having it write a quick summary of exactly what it will do where the AI writes a plan including class names branch names file locations specific tests etc. is helpful before I hit go, since the code outline is smaller and quicker to correct.
That takes more wall clock time per agent, but gets better results, so fewer redo steps.
Hopefully there is some of lint process to catch my human hallucinations and typos.
This is exactly the sort of future I'm afraid of. Where the people who are ostensibly hired to know how stuff works, out source that understanding to their LLMs. If you don't know how the system works while building, what are you going to when it breaks? Continue to throw your LLM at it? At what point do you just outsource your entire brain?
I’m not sure it’s really true in practice yet, but that would certainly be the claim.
Because, after they're done/have finished executing, I guess you still have to "check" their output, integrate their results into the bigger project they're (supposedly) part of etc, and for me the context-switching required to do all that is mentally taxing. But maybe this only happens because my brain is not young enough, that's why I'm asking.
I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.
In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).
My experience with what coding assistants are good for shifted from:
smart autocomplete -> targeted changes/additions -> full engineering
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
I think it can (and is) shifting very rapidly. Everyone is different, and I’m sure models are better at different types of work (or styles of working), but it doesn’t take much to make it too frustrating to use. Which also means it doesn’t take much to make it super useful.
Opus 4.6 has been out for less than a month. If it was a big shift surely we'd see a massive difference over 4.5 which was november. I think this proves the point, you're not seeing seisimic shifts every 3 months and you're not even clear about which model was the fix.
> I think it can (and is) shifting very rapidly.
Shifting, maybe. But shuffling deck chairs every 3 months.
Especially good to navigate the code if you're unfamiliar with it (the code). If you have known the code for good, you'll find it's usually faster to debug and code by yourself.
Opus 4.6 with claude code vscode extension
No. The parent comment said I needed a new model, which I've tried. Being told "just try something else aswell" kind of proves the point.
Perfect example. You mean the C compiler that literally failed to compile a hello world [0] (which was given in it's readme)?
> What do you consider simple issues?
Hallucinating APIs for well documented libraries/interfaces, ignoring explicit instructions for how to do things, and making very simple logic errors in 30-100 line scripts.
As an example, I asked Claude code to help me with a Roblox game last weekend, and specifically asked it to "create a shop GUI for <X> which scales with the UI, and opens when you press E next to the character". It proceeded to create a GUI with absolute sizings, get stuck on an API hallucination for handling input, and also, when I got it unstuck, it didn't actually work.
[0] https://github.com/anthropics/claudes-c-compiler/issues/1
But the most important thing is that they were reverse engineering gcc by using it as an oracle. And it had gcc and thousands of other c compilers in its training set.
So if you are a large corporation looking to copy GPL code so that you can use it without worrying about the license, and the project you want to copy is a text transformer with a rigorously defined set of inputs and outputs, have at it.
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
> smart autocomplete -> targeted changes/additions -> full engineering
Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.
I.e. the point where the agent writes all the code and you just verify.
> I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
It squares up just fine.
You ever read a blog post or comment and think "Yeah, this is definitely AI generated"? If you can recognise it, would you accept a blog post, reviewed by you, for your own blog/site?
I won't; I'll think "eww" and rewrite.
The developers with good AI experiences don't get the same "eww" feeling when reading AI-generated code. The developers with poor AI experiences get that "eww" feeling all the time when reviewing AI code and decide not to accept the code.
Well, that's my theory anyway.
I do this with code too.
In my experience, this heavily depends on the task, and there's a massive chasm between tasks where it's a good and bad fit. I can definitely imagine people working only on one side of this chasm and being perplexed by the other side.
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
I’ve yet to see these things do well on anything but trivial boilerplate.
The benefit is I can keep some things ticking over while I’m in meetings, to be honest.
I don’t think you have to square them because those sentiments are coming from different people. They are also coming from people at different points along the adoption curve. If you are struggling and you see other people struggling at the beginning of the adoption curve it can be quite difficult difficult to understand someone who is further along and does not appear to be struggling.
I think a lot of folks who have struggled with these tools do so because both critics and boosters create unrealistic expectations.
What I recommend is you keep trying. This is a new skill set. It is a different skill set. Which other skills that existed in the past remain necessary is not known.