This sounds like arguing you can use these models to beat a game of whack-a-mole if you just know all the unknown unknowns and prompt it correctly about them.
This is an assertion that is impossible to prove or disprove.
I rarely have blocks of "flow time" to do focused work. With LLMs I can keep progressing in parallel and then when I get to the block of time where I can actually dive deep it's review and guidance again - focus on high impact stuff instead of the noise.
I don't think I'm any faster with this than my theoretical speed (LLMs spend a lot of time rebuilding context between steps, I have a feeling current level of agents is terrible at maintaining context for larger tasks, and also I'm guessing the model context length is white a lie - they might support working with 100k tokens but agents keep reloading stuff to context because old stuff is ignored).
In practice I can get more done because I can get into the flow and back onto the task a lot faster. Will see how this pans out long term, but in current role I don't think there are alternatives, my performance would be shit otherwise.
No, but they can take "notes" and can load those notes into context. That does work, but is of course not so easy as it is with humans.
It is all about cleaning up and maintaining a tidy context.
This is a joke right? There are complex systems that exist today that are built exclusively via AI. Is that not obvious?
The existence of such complex systems IS proof. I don't understand how people walk around claiming there's no proof? Really?
It is impossible to prove or disprove because if everything DOES NOT work fine you can always say that the prompts were bad, the agent was not configured correctly, the model was old, etc. And if it DOES work, then all of the previous was done correctly, but without any decent definition of what correct means.
If a program works, it means it's correct. If we know it's correct, it means we have a definition of what correct means otherwise how can we classify anything as "correct" or "incorrect". Then we can look at the prompts and see what was done in those prompts and those would be a "correct" way of prompting the LLM.
lets say i accept you and you alone have the deep majiks required to use this tool correctly, when major platform devs could not so far, what makes this tool useful? Billions of dollars and environment ruining levels of worth it?
I'd say the only real use for these tools to date has been mass surveillance, and sometimes semi useful boilerplate.
It doesn't, that's ego-preserving cope. Saying that this stuff doesn't work for "damn well near every professional" because it doesn't work for you is like a thief saying "Everybody else steals, why are you picking on me"? It's not true, it's something you believe to protect your own self-image.
point me towards something complex which llms have contributed towards significantly without massive oversight where they didnt fuck things up. I'll eat my words happily, with just a single example.
Then on Sunday I woke up and had claude bang out a series of half a dozen projects each using this GUI library. First, a script that simply offers to loop a video when the end is reached. Updated several of my old scripts that just print text without any graphical formatting. Then more adventurous, a playlist visualizer with support for drag to reorder. Another that gives a nice little control overlay for TTS reading normal media subtitles. Another that let's people select clips from whatever they're watching, reorder them and write out an edit decision list, maybe I'll turn this one into a complete NLE today when I get home from work.
Reading every line of code? Why? The shit works, if I notice a bug I go back to claude and demand a "thoughtful and well reasoned" fix, without even caring what the fix will be so long as it works.
The concepts and building blocks used for all of this is shit I've learned myself the hard way, but to do it all myself would take weeks and I would certainly take many shortcuts, like certainly skipping animations and only implementing the bare minimum. The reason I could make that stuff work fast is because I already broadly knew the problem space, I've probably read the mpv manpage a thousand times before, so when the agent says its going to bind to shift+wheel for horizonal scrolling, I can tell it no, mpv has WHEEL_LEFT and RIGHT, use those. I can tell it to pump its brakes and stop planning to load a PNG overlay, because mpv will only load raw pixel data that way. I can tell it that dragging UI elements without simultaneously dragging the whole window certainly must be possible, because the first party OSC supports it so it should go read that mess of code and figure it out, which it dutifully does. If you know the problem space, you can get a whole lot done very fast, in a way that demonstrably works. Does it have bugs? I'd eat a hat if it doesn't. They'll get fixed if/when I find them. I'm not worried about it. Reading every line of code is for people writing airliner autopilots, not cheeky little desktop programs.
I think it's fair to say that you can get a long way with Claude very quickly if you're an individual or part of a very small team working on a greenfield project. Certainly at project sizes up to around 100k lines of code, it's pretty great.
But I've been working startups off and on since 2024.
My last "big" job was with a company that had a codebase well into the millions of lines of code. And whilst I keep in contact with a bunch of the team there, and I know they do use Claude and other similar tools, I don't get the vibe it's having quite the same impact. And these are very talented engineers, so I don't think it's a skill either.
I think it's entirely possible that Claude is a great tool for bootstrapping and/or for solo devs or very small teams, but becomes considerably less effective when scaled across very large codebases, multiple teams, etc.
For me, on that last point, the jury is out. Hopefully the company I'm working with now grows to a point where that becomes a problem I need to worry about but, in the meantime, Claude is doing great for us.
The skill part is real — giving the agent the right context, breaking tasks into the right size, knowing when to intervene. Most people aren't doing that well and their results reflect it.
But the latent bug problem isn't really a skill issue. It's a property of how these systems work: the agent optimises for making the current test pass, not for building something that stays correct as requirements change. Round 1 decisions get baked in as assumptions that round 3 never questions — and no amount of better prompting fixes that.
The fix isn't better prompting. It's treating agent-generated code with the same scepticism you'd apply to code from a contractor who won't be around to maintain it — more tests, explicit invariants, and not letting the agent touch the architecture without a human reviewing the design first.
> The vibes are not enough. Define what correct means. Then measure.
And how do you define correct feedback? If the output is correct?
1. Agent context with platform/system idiosyncrasies, how to access tools, this is actually kept pretty minimal - and a line directing it to the plan document.
2. A plan document on how to make changes to the repo and work that needs to be done. This is a living document pruned by the orchestrating agent. Included in this document is a directive written by you to use, update the document after ever run. Here also is a guide on benchmarking, regression, unit tests that need to be performed every time.
2a. When an agent has a code change it is then analyzed by a council of subagents, each focused on a different area, some examples, security, maintainability, system architect, business domain expert. I encourage these to be adversarial "red team". We sit in the core loop until the code changes pass through the council.
2b. Additional subagents to create documentation, build architecture diagrams etc.
2c. A suggested workflow is created on how to independently invoke testing, and subagent, etc.
It's really amazing, we've crossed a threshold, and I don't know what that means for our jobs.
> Another AI agent. This one is awesome, though, and very secure.
it isn't secure. It took me less than three minutes to find a vulnerability. Start engaging with your own code, it isn't as good as you think it is.
edit: i had kimi "red team" it out of curiosity, it found the main critical vulnerability i did and several others
Severity - Count - Categories
Critical - 2 - SQL Injection, Path Traversal
High - 4 - SSRF, Auth Bypass, Privilege Escalation, Secret Exposure
Medium - 3 - DoS, Information Disclosure, Injection
You need to sit down and really think about what people who do know what they're doing are saying. You're going to get yourself into deep trouble with this. I'm not a security specialist, i take a recreational interest in security, and llm's are by no means expert. A human with skill and intent would, i would gamble, be able fuck your shit up in a major way.