undefined

upvote

points

by reedf13 hours ago |

upvote

by Zafira3 hours ago|

[-]

> At the moment it is a mysterious, occasionally fickle, tool - but if you provide the correct feedback mechanisms and provide small tweaks and context at idiosyncrasies, it's possible to get agents to reliably build very complex.

This sounds like arguing you can use these models to beat a game of whack-a-mole if you just know all the unknown unknowns and prompt it correctly about them.

This is an assertion that is impossible to prove or disprove.

reply

upvote

by rafaelmn3 hours ago|

[-]

No it's more like if you knew how to build it before - LLM agents help you build it faster. There's really no useful analogy I can think of, but it fits my current role perfectly because my work is constantly interrupted by prod support, coordination, planning, context switching between issues etc.

I rarely have blocks of "flow time" to do focused work. With LLMs I can keep progressing in parallel and then when I get to the block of time where I can actually dive deep it's review and guidance again - focus on high impact stuff instead of the noise.

I don't think I'm any faster with this than my theoretical speed (LLMs spend a lot of time rebuilding context between steps, I have a feeling current level of agents is terrible at maintaining context for larger tasks, and also I'm guessing the model context length is white a lie - they might support working with 100k tokens but agents keep reloading stuff to context because old stuff is ignored).

In practice I can get more done because I can get into the flow and back onto the task a lot faster. Will see how this pans out long term, but in current role I don't think there are alternatives, my performance would be shit otherwise.

reply

upvote

by dkdbejwi3833 hours ago|

[-]

You could probably replace LLM with "junior engineer" here as it sounds like you're basically a manager now. The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback.

reply

upvote

by lukan3 hours ago|

[-]

"The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback."

No, but they can take "notes" and can load those notes into context. That does work, but is of course not so easy as it is with humans.

It is all about cleaning up and maintaining a tidy context.

reply

upvote

by reedf12 hours ago|

[-]

The same is true with human engineers - isn't this just what engineering is?

reply

upvote

by threethirtytwo2 hours ago|

[-]

>This is an assertion that is impossible to prove or disprove.

This is a joke right? There are complex systems that exist today that are built exclusively via AI. Is that not obvious?

The existence of such complex systems IS proof. I don't understand how people walk around claiming there's no proof? Really?

reply

upvote

by mmustapic1 hours ago|

[-]

The assertion was "if you really know how to prompt, give feedback, do small corrections and fix LLM errors, then everything works fine".

It is impossible to prove or disprove because if everything DOES NOT work fine you can always say that the prompts were bad, the agent was not configured correctly, the model was old, etc. And if it DOES work, then all of the previous was done correctly, but without any decent definition of what correct means.

reply

upvote

by threethirtytwo1 hours ago|

[-]

>And if it DOES work, then all of the previous was done correctly, but without any decent definition of what correct means.

If a program works, it means it's correct. If we know it's correct, it means we have a definition of what correct means otherwise how can we classify anything as "correct" or "incorrect". Then we can look at the prompts and see what was done in those prompts and those would be a "correct" way of prompting the LLM.

reply

upvote

by Grimblewald3 hours ago|

[-]

Say i buy into your mysticism based take, is it a useful tool if it blows up in damn well near every professionals face?

lets say i accept you and you alone have the deep majiks required to use this tool correctly, when major platform devs could not so far, what makes this tool useful? Billions of dollars and environment ruining levels of worth it?

I'd say the only real use for these tools to date has been mass surveillance, and sometimes semi useful boilerplate.

reply

upvote

by mikkupikku2 hours ago|

[-]

> is it a useful tool if it blows up in damn well near every professionals face

It doesn't, that's ego-preserving cope. Saying that this stuff doesn't work for "damn well near every professional" because it doesn't work for you is like a thief saying "Everybody else steals, why are you picking on me"? It's not true, it's something you believe to protect your own self-image.

reply

upvote

by Grimblewald1 hours ago|

[-]

It works fine for me, because I don't trust to to do anything more than line changes, self-contained prototypes or minor updates. I've seen what happens when people give llms full control over authoring, and it has never worked out. Just look at the dumpster fire spotify has become. Its a buggy mess that barely works and is openly touted as being primarily llm developed at present.

point me towards something complex which llms have contributed towards significantly without massive oversight where they didnt fuck things up. I'll eat my words happily, with just a single example.

reply

upvote

by reedf12 hours ago|

[-]

Honestly people are in such a weird place with this shit. I'm not saying don't read the fucking code - but I managed to get my setup to write 100k lines of indistinguishable SWE code in a week or so. The main limitation was my reading speed. This is something like a 10x speedup for me.

reply

upvote

by Grimblewald2 hours ago|

[-]

How does one verify 100k lines in a week? Let alone evaluate it to being SWE equivallent? That's super human. I like to think I am pretty good at what I do, but really critically engaging with 100k lines in a week is beyond even 10 of me. Forgive my skepticism, but I'm going to hazard the guess that you don't know what the fuck you're doing. You've lost your goddamn mind if you think you're doing anything other than skim read at a rate of 42 lines a minute for your entire work day without a break.

reply

upvote

by reedf11 hours ago|

[-]

You have a fair perspective and I'm not going to try to move you from it. You have my exact opinion as of 3 months ago. I will just suggest you earnestly try it yourself.

reply

upvote

by mikkupikku1 hours ago|

[-]

On Saturday I had claude generate ~10k of lines of Lua code which uses the libASS subtitle format to build up nearly two dozen GUI widgets from subtitle drawing primitives, including nestable scrollable containers with clipping, drop down menus, animated tab bars, and everything else I could think of. I read probably about 100 lines of code myself that day, I "verified" the code only by testing out the demo claude was updating through the process.

Then on Sunday I woke up and had claude bang out a series of half a dozen projects each using this GUI library. First, a script that simply offers to loop a video when the end is reached. Updated several of my old scripts that just print text without any graphical formatting. Then more adventurous, a playlist visualizer with support for drag to reorder. Another that gives a nice little control overlay for TTS reading normal media subtitles. Another that let's people select clips from whatever they're watching, reorder them and write out an edit decision list, maybe I'll turn this one into a complete NLE today when I get home from work.

Reading every line of code? Why? The shit works, if I notice a bug I go back to claude and demand a "thoughtful and well reasoned" fix, without even caring what the fix will be so long as it works.

The concepts and building blocks used for all of this is shit I've learned myself the hard way, but to do it all myself would take weeks and I would certainly take many shortcuts, like certainly skipping animations and only implementing the bare minimum. The reason I could make that stuff work fast is because I already broadly knew the problem space, I've probably read the mpv manpage a thousand times before, so when the agent says its going to bind to shift+wheel for horizonal scrolling, I can tell it no, mpv has WHEEL_LEFT and RIGHT, use those. I can tell it to pump its brakes and stop planning to load a PNG overlay, because mpv will only load raw pixel data that way. I can tell it that dragging UI elements without simultaneously dragging the whole window certainly must be possible, because the first party OSC supports it so it should go read that mess of code and figure it out, which it dutifully does. If you know the problem space, you can get a whole lot done very fast, in a way that demonstrably works. Does it have bugs? I'd eat a hat if it doesn't. They'll get fixed if/when I find them. I'm not worried about it. Reading every line of code is for people writing airliner autopilots, not cheeky little desktop programs.

reply

upvote

by Grimblewald1 hours ago|

[-]

right, but we're not talking cheeky desktop personal programs here. Who cares what you do for your own devices and in your own time? Absolutely no one gets to tell you what to do there, so why discuss it? Live your life, do your thing, great! But that isn't what is being discussed here.

reply

upvote

by bartread2 hours ago|

[-]

> It's starting to become obvious that if you can't effectively use AI to build systems it is a skill issue.

I think it's fair to say that you can get a long way with Claude very quickly if you're an individual or part of a very small team working on a greenfield project. Certainly at project sizes up to around 100k lines of code, it's pretty great.

But I've been working startups off and on since 2024.

My last "big" job was with a company that had a codebase well into the millions of lines of code. And whilst I keep in contact with a bunch of the team there, and I know they do use Claude and other similar tools, I don't get the vibe it's having quite the same impact. And these are very talented engineers, so I don't think it's a skill either.

I think it's entirely possible that Claude is a great tool for bootstrapping and/or for solo devs or very small teams, but becomes considerably less effective when scaled across very large codebases, multiple teams, etc.

For me, on that last point, the jury is out. Hopefully the company I'm working with now grows to a point where that becomes a problem I need to worry about but, in the meantime, Claude is doing great for us.

reply

upvote

by ptak_dev2 hours ago|

[-]

Partially agree, but I think "skill issue" undersells the genuine reliability problem the original post is describing.

The skill part is real — giving the agent the right context, breaking tasks into the right size, knowing when to intervene. Most people aren't doing that well and their results reflect it.

But the latent bug problem isn't really a skill issue. It's a property of how these systems work: the agent optimises for making the current test pass, not for building something that stays correct as requirements change. Round 1 decisions get baked in as assumptions that round 3 never questions — and no amount of better prompting fixes that.

The fix isn't better prompting. It's treating agent-generated code with the same scepticism you'd apply to code from a contractor who won't be around to maintain it — more tests, explicit invariants, and not letting the agent touch the architecture without a human reviewing the design first.

reply

upvote

by zihotki2 hours ago|

[-]

According to the https://blog.katanaquant.com/p/your-llm-doesnt-write-correct... previously discussed on HN, it may be at least partially true:

> The vibes are not enough. Define what correct means. Then measure.

reply

upvote

by croes2 hours ago|

[-]

> if you provide the correct feedback

And how do you define correct feedback? If the output is correct?

reply

upvote

by reedf12 hours ago|

[-]

I don't know if you deliberately cut-off the full point, but for the benefit of those with tired eyes I said 'feedback mechanisms', i.e. feedback in the control system sense.

reply

upvote

by croes58 minutes ago|

[-]

The cut-off was not intended but what is the correct one? Is it wrong as soon the result is wrong?

reply

upvote

by reedf150 minutes ago|

[-]

We are still in early days and I'm sure this isn't the best way to approach this but here is what I do.

1. Agent context with platform/system idiosyncrasies, how to access tools, this is actually kept pretty minimal - and a line directing it to the plan document.

2. A plan document on how to make changes to the repo and work that needs to be done. This is a living document pruned by the orchestrating agent. Included in this document is a directive written by you to use, update the document after ever run. Here also is a guide on benchmarking, regression, unit tests that need to be performed every time.

2a. When an agent has a code change it is then analyzed by a council of subagents, each focused on a different area, some examples, security, maintainability, system architect, business domain expert. I encourage these to be adversarial "red team". We sit in the core loop until the code changes pass through the council.

2b. Additional subagents to create documentation, build architecture diagrams etc.

2c. A suggested workflow is created on how to independently invoke testing, and subagent, etc.

reply

upvote

by mikkupikku2 hours ago|

[-]

Truth Nuke

reply

upvote

by jamiemallers3 hours ago|

[-]

[dead]

reply

upvote

by stavros3 hours ago|

[-]

I'd agree, I've been building a personal assistant (https://github.com/skorokithakis/stavrobot) and I'm amazed that, for the first time ever, LLMs manage to build reliably, with much fewer bugs than I'd expect from a human, and without the repo devolving to unmaintainability after a few cycles.

It's really amazing, we've crossed a threshold, and I don't know what that means for our jobs.

reply

upvote

by Grimblewald2 hours ago|

[-]

No bugs means nothing if bugs get hidden and llms are great at hiding bugs and will absolutely fail to find some fairly critical ones. Your own repo, which is slop at best, fails to meet its core premise

> Another AI agent. This one is awesome, though, and very secure.

it isn't secure. It took me less than three minutes to find a vulnerability. Start engaging with your own code, it isn't as good as you think it is.

edit: i had kimi "red team" it out of curiosity, it found the main critical vulnerability i did and several others

Severity - Count - Categories

Critical - 2 - SQL Injection, Path Traversal

High - 4 - SSRF, Auth Bypass, Privilege Escalation, Secret Exposure

Medium - 3 - DoS, Information Disclosure, Injection

You need to sit down and really think about what people who do know what they're doing are saying. You're going to get yourself into deep trouble with this. I'm not a security specialist, i take a recreational interest in security, and llm's are by no means expert. A human with skill and intent would, i would gamble, be able fuck your shit up in a major way.

reply

upvote

by reedf11 hours ago|

[-]

Build a redteam into your feedback mechanism. Seriously. You've identified the problem and even solved it. Now automate it.

reply