The more expensive and less sexy option is to actually make testing easier (both programmatically and manually), write more tests and more levels of tests, and spend time reducing code complexity. The problem, I think, is people don't get promoted for preventing issues.
The key to making this scalable is to make as few parts as possible critical, and make the potential bad outcomes as benign as possible. (This lets you go to a lower rating in whatever safety standard applies to your industry.) You still need tests for the less critical parts though, while downtime is better than injury, if you want to sell future machines to your customers you need to have a good track record. At least if you don't want to compete on cost.
This is a good lesson for anyone I think. Definitely something I’m going to think more about. Thanks for sharing!
If you told someone "I don't trust you, run all code by me first" it wouldn't go well. If you tell them "everyone's code gets reviewed" they're ok with it.
they do - but only after a company has been burned hard. They also can be promoted for their area being enough better that everyone notices.
still the best way to a promotion is write a major bug that you can come in at the last moment and be the hero for fixing.
Two years afterward, we got hit with ransomware. And obviously "I told you so" isn't a productive discussion topic at that point.
cleaning up structural issues across a couple orgs is a senior => principal promo ive seen a couple of times
Unchecked, AI models output code that is as buggy as it is inefficient. In smaller green field contexts, it's not so bad, but in a large code base, it's performs much worse as it will not have access to the bigger picture.
In my experience, you should be spending something like 5-15X the time the model takes to implement a feature on reviewing and making it fix its errors and inefficiencies. If you do that (with an expert's eye), the changes will usually have a high quality and will be correct and good.
If you do not do that due dilligence, the model will produce a staggering amount of low quality code, at a rate that is probably something like 100x what a human could output in a similar timespan. Unchecked, it's like having a small army of the most eager junior devs you can find going completely fucking ape in the codebase.
What do the relatively hands-off "it can do whole features at a time" coding systems need to function without taking up a shitload of time in reviews? Great automated test coverage, and extensive specs.
I think we're going to find there's very little time-savings to be had for most real-world software projects from heavy application of LLMs, because the time will just go into tests that wouldn't otherwise have been written, and much more detailed specs that otherwise never would have been generated. I guess the bright-side take of this is that we may end up with better-tested and better-specified software? Though so very much of the industry is used to skipping those parts, and especially the less-capable (so far as software goes) orgs that really need the help and the relative amateurs and non-software-professionals that some hope will be able to become extremely productive with these tools, that I'm not sure we'll manage to drag processes & practices to where they need to be to get the most out of LLM coding tools anyway. Especially if the benefit to companies is "you will have better tests for... about the same amount of software as you'd have written without LLMs".
We may end up stuck at "it's very-aggressive autocomplete" as far as LLMs' useful role in them, for most projects, indefinitely.
On the plus side for "AI" companies, low-code solutions are still big business even though they usually fail to deliver the benefits the buyer hopes for, so there's likely a good deal of money to be made selling companies LLM solutions that end up not really being all that great.
Code is the most precise specification we have for interfacing with computers.
So I expect over time we will see genuine performance improvements, but Amdahl's law dictates it won't be as much as some people and ceo's are expecting.
Writing tests to ensure a program is correct is the same problem as writing a correct program.
Evaluating conformance is a different category of concern from ensuring correctness. Tests are about conformance not correctness.
Ensuring correct programs is like cleaning in the sense that you can only push dirt around, you can't get rid of it.
You can push uncertainty around and but you can't eliminate it.
This is the point of Gödel's theorem. Shannon's information theory observes similar aspects for fidelity in communication.
As Douglas Adams noted: ultimately you've got to know where your towel is.
One thing I hope we'll all collectively learn from this is how grossly incompetent the elite managerial class has become. They're destroying society because they don't know what to do outside of copying each other.
It has to end.
For fairly straightforward changes it's probably a wash, but ironically enough it's often the trickier jobs where they can be beneficial as it will provide an ansatz that can be refined. It's also very good at tedious chores.
People seem to gloss over this... As a CEO if people don't function like this I'd be awake at night sweating.
Which results the software engineering issue I’m not seeing addressed by the hype: bugs cost tens to hundreds of times their coding cost to resolve if they require internal or external communication to address. Even if everyone has been 10x’ed, the math still strongly favours not making mistakes in the first place.
An LLM workflow that yields 10x an engineer but psychopathically lies and sabotages client facing processes/resources once a quarter is likely a NNPP (net negative producing programmer), once opportunity and volatility costs are factored in.
The math depends on importance of the software. A mistake in a typical CRUD enterprise app with 100 users has zero impact on anything. You will fix it when you have time, the important thing is that the app was delivered in a week a year ago and was solving some problem ever since. It has already made enormous profit if you compare it with today’s (yesterday’s ?) manual development that would take half a year and cost millions.
A mistake in a nuclear reactor control code would be a total different thing. Whatever time savings you made on coding are irrelevant if it allowed for a critical bug to slip through.
Between the two extremes you thus have a whole spectrum of tasks that either benefit or lose from applying coding with LLMs. And there are also more axes than this low to high failure cost, which also affect the math. For example, even non-important but large app will likely soon degrade into unmanageable state if developed with too little human intervention and you will be forced to start from scratch loosing a lot of time.
We as an industry have been able to offload a lot of “how” via deterministic systems built by humans with expert understanding. LLMs give you the illusion of this.
1. I spoke to sales to find out about the customer
2. I read every line of the contract (SOW)
3. I did the initial requirements gathering over a couple of days with the client - or maybe up to 3 weeks
3. I designed every single bit of AWS architecture and code
4. I did the design review with the client
5. I led the customer acceptance testing
> We as an industry have been able to offload a lot of “how” via deterministic systems built by humans with expert understanding. LLMs
I assure you the mid level developers or god forbid foreign contractors were not “experts” with 30 years of coding experience and at the time 8 years of pre LLM AWS experience. It’s been well over a decade - ironically before LLMs - that my responsibility was only for code I wrote with my own two hands
I’m not saying trusting cheap devs is a good idea either. I do think cheap devs are actually at risk here.
I didn’t blindly trust the Salesforce consultants either. I also didn’t verify every line of oSql (not a typo) they wrote.
I disagree, in the sense that an engineer who knows how to work with LLMs can produce code which only needs light review.
* Work in small increments
* Explicitly instruct the LLM to make minimal changes
* Think through possible failure modes
* Build in error-checking and validation for those failure modes
* Write tests which exercise all paths
This is a means to produce "viable" code using an LLM without close review. However, to your point, engineers able to execute this plan are likely to be pretty experienced, so it may not be economically viable.
The gains are especially notable when working in unfamiliar domains. I can glance over code and know "if this compiles and the tests succeed, it will work", even if I didn't have the knowledge to write it myself.
... Errr... Yeah, that's not a great approach, unless you are defining 'work' extremely vaguely.
I still make an effort to understand the generated code. If there’s a section I don’t get, I ask the LLM to explain it.
Most of the time it’s just API conventions and idioms I’m not yet familiar with. I have strong enough fundamentals that I generally know what I’m trying to accomplish and how it’s supposed to work and how to achieve it securely.
For example, I was writing some backend code that I knew needed a nonce check but I didn’t know what the conventions were for the framework. So I asked the LLM to add a nonce check, then scanned the docs for the code it generated.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
>When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.
If we're being honest with ourselves, it's not making devs work faster. It at best frees their time up so they feel more productive.
I'd like to think that I have this under control because the methodology of working in small increments helps me to recognize when I've gotten stuck in an eddy, but I'll have to watch out for it.
I still maintain that the LLM is saving me time overall. Besides helping in unfamiliar domains, it's also faster than me at leaf-node tasks like writing unit tests.
Yes, code produced this way will have bugs, especially of the "unknown unknown" variety — but so would the code that I would have written by hand.
I think a bigger factor contributing to unforeseen bugs is whether the LLM's code is statistically likely to be correct:
* Is this a domain that the LLM has trained on a lot? (i.e. lots of React code out there, not much in your home-grown DSL)
* Is the codebase itself easy to understand, written with best practices, and adhering to popular conventions? Code which is hard for humans to understand is also hard for an LLM to understand.
It introduces unnecessary indirection, additional abstractions, fails to re-use code. Humans do this too, but AI models can introduce this type of architectural rot much faster (because it's so fast), and humans usually notice when things start to go off the rails, whereas an AI model will just keep piling on bad code.
---
applyTo: '**'
---
By default:
Make the smallest possible change.
Do not refactor existing code unless I explicitly ask.
Under this, Claude Opus at least produces pretty reliable code with my methodology even under surprisingly challenging circumstances, and recent ChatGPTs weren't bad either (though I'm no longer using them). Less powerful LLMs struggle, though.But I would never do the same for Azure.
"Seniors will do expert review" will slowly collapse.
No one cares about handcrafted artisanal code as long as it meets both functional and non functional requirements. The minute geeks get over themselves thinking they are some type of artists, the happier they will be.
I’ve had a job that requires coding for 30 years and before ther I was hobbyist and I’ve worked for from everything from 60 person startups to BigTech.
For my last two projects (consulting) and my current project, while I led the project, got the requirements, designed the architecture from an empty AWS account (yes using IAC) and delivered it. I didn’t look at a line of code. I verified the functional and non functional requirements, wrote the hand off documentation etc.
The customer is happy, my company is happy, and I bet you not a single person will ever look at a line of code I wrote. If they do get a developer to take it over, the developer will be grateful for my detailed AGENTS.md file.
We know from experimentation that agents will change anything that isn’t nailed down. No natural language spec or test suite has ever come close to fully describing all observable behaviors of a non-trivial system.
This means that if no one is reviewing the code, agents adding features will change observable behaviors.
This gets exposed to users as churn, jank, and broken work flows.
2. Assuming that techniques that work with human developers that have severely impaired judgement but are massively faster at producing code is a bad idea.
3. There’s no way you have enough experience with maintaining code written in this way to confidently hand wave away concerns.
So many people on HN are so insulted that the people who put money in our bank accounts and in some cases stock in our brokerage accounts ever cared about their bespoke clean code, GOF patterns and they never did. LLM just made it more apparent.
It’s always been dumb for PR to be focused on for loops vs while loops instead of focusing on whether functional and non functional requirements are met
Speak for yourself. I don't hire people like you.
Even in late 2023 with the shit show of the current market, I had no issues having multiple offers within three weeks just by reaching out to my network and companies looking for people with my set of skills.
You sound like a bozo, I can sniff it through my screen.
Guess what? I also stopped caring how registers are used and counting clock cycles in my assembly language code like it’s the 80s and I’m still programming on a 1Mhz 65C02
But do you look at any of the AI output? Or is it just "it works, ship it"?
What I checked.
1. The bash shell scripts I had it write as my integration test suite
2. To make sure it wasn’t loading the files into Postgres the naive way -loading the file from S3 and doing bulk inserts instead of using the AWS extension that lets it load directly from S3. It’s the differ xe between taking 20 minutes and 20 seconds.
3. I had strict concurrency and failure recovery requirements. I made sure it was done the right way.
4. Various security, logging, log retention requirements
What I didn’t look at - a line of the code for the web admin site. I used AWS Cognito for authentication and checked to make sure that unauthorized users couldn’t use the website. Even that didn’t require looking at the code - I had automated tests that tested all of the endpoints.
I've witnessed human developers produce incredibly convoluted, slow "ETL pipelines" that took 10+ minutes to load single digit megabytes of data. It could've been reduced to a shell script that called psql \copy.
Hell, often it feels slower/worse. Foreign code is easily confusing at first, which slows you down - and bad code quickly gets bewildering and sends you down paths of clarifications that waste time.
Then often it blows up in production. Makes me almost want to blanket reject PRs for being too difficult to understand. Hand written code almost has an aversion to complexity, you'd search around for existing examples, libraries, reusable components, or just a simpler idea before building something crazy complex. While with AI you can spit out your first idea quickly no matter how complex or flawed the original concept was.
It's actually often harder to fix something sloppy than to write it from scratch. To fix it, you need to hold in your head both the original, the new solution, and calculate the difference, which can be very confusing. The original solution can also anchor your thinking to some approach to the problem, which you wouldn't have if you solve it from scratch.
If AI is a productivity boost and juniors are going to generate 10x the PRs, do you need 10x the seniors (expensive) or 1/10th the juniors (cost save).
A reminder that in many situations, pure code velocity was never the limiting factor.
Re: idiot prooofing I think this is a natural evolution as companies get larger they try to limit their downside & manage for the median rather than having a growth mindset in hiring/firing/performance.
I suspect that isn't the goal.
Review by more senior people shifts accountability from the Junior to a Senior, and reframes the problem from "Oh dear, the junior broke everything because they didn't know any better" to "Ah, that Senior is underperforming because they approved code that broke everything."
Especially in a big co like Amazon, most senior engineers are box drawers, meeting goers, gatekeepers, vision setters, org lubricants, VP's trustees, glorified product managers, and etc. They don't necessarily know more context than the more junior engineers, and they most likely will review slowly while uncovering fewer issues.
Whether or not these productivity gains are realized is another question, but spreadsheet based decision makers are going to try.
Also - the definition of Senior will change, and a lot of current Seniors will not transition, while plenty of Juniors that put in a lot of time using code agents will transition.
But will they? I'm not at all convinced that babysitting an AI churning out volumes of code you don't understand will help you acquire the knowledge to understand and debug it.
American corporate culture has decided that training costs are someone else’s problem. Since every corporation acts this way it means all training costs have been pushed onto the labor market. Combine that with the past few decades of “oops, looks like you picked the wrong career that took years of learning and/or 10 to 100s of thousands of dollars to acquire but we’ve obsoleted that field” and new entrants into the labor market are just choosing not to join.
Take trucking for example. For the past decade I’ve heard logistics companies bemoan the lack of CDL holders, while simultaneously gleefully talk about how the moment self driving is figured out they are going to replace all of them.
We’re going to be outpaced by countries like China at some point because we’re doing the industrial equivalent of eating our seed corn and there is seemingly no will to slow that trend down, much less reverse it.
I know I'm probably coming across as a lunatic lately on HN but I really do think we're on the path towards violence thanks to AI
You just cannot destroy this many people's livelihoods without backlash. It's leading nowhere good
But a handful of people are getting stupidly rich/richer so they'll never stop
If you look at the luddite rebellion they weren't actually against industrial technology like looms. They were against being told they weren't needed anymore and thrown to the wolves because of the machines.
The rich have forgotten they are made of meat and/or are planning on returning to feudalism ala Yarvin, Thiel, Musk, and co's politics.
I guess that makes me a modern luddite then
A software engineer luddite
A techno-luddite if you will
Maybe I have a new username
Maybe I don't have the correct mental model for how the typical junior engineer thinks though. I never wanted to bug senior people and make demands on their time if I could help it.
With a layout of 4 juniors, 5 intermediates, and 0-1 senior per team, putting all the changes through senior engineer review means you mostly wont be able to get CRs approved.
I guess it could result in forcing everyone who's sandbagging as intermediate instead of going to senior to have to get promoted?
My manager has been urging us to truly vibe code, just yesterday saying that "language is irrelevant because we've reached the point where it works - so you don't need to see it." This article is a godsend; I'll take this flawed silver bullet any day of the week.
I’m probably not going to review a random website built by someone except for usability, requirements and security.
I also said senior review is valuable, but I'm not 100% sure if you're implying I didn't.
The other problem is that the type of errors LLMs make are different than juniors. There are huge sections of genuinely good code. So the senior gets "review fatigue" because so much looks good they just start rubber stamping.
I use an automated pipeline to generate code (including terraform, risking infrastructure nukes), and I am the senior reviewer. But I have gates that do a whole range of checks, both deterministic and stochastic, before it ever gets to me. Easy things are pushed back to the LLM for it to autofix. I only see things where my eyes can actually make a difference.
Amazon's instinct is right (add a gate), but the implementation is wrong (make it human). Automated checks first, humans for what's left.
1. They can assess whether the use of AI is appropriate without looking in detail. E.g. if the AI changed 1000 lines of code to fix a minor bug, or changed code that is essential for security.
2. To discourage AI use, because of the added friction.
I hear “x tool doesn’t really work well” and then I immediately ask: “does someone know how to use it well?” The answer “yes” is infrequent. Even a yes is often a maybe.
The problem is pervasive in my world (insurance). Number-producing features need to work in a UX and product sense but also produce the right numbers, and within range of expectations. Just checking the UX does what it’s supposed to do is one job, and checking the numbers an entirely separate task.
I don’t many folks that do both well.
I would actually say having at least 2 people on any given work item should probably be the norm at Amazon's size if you also want to churn through people as Amazon does and also want quality.
Doing code reviews are not as highly valued in terms of incentives to the employees and it blocks them working on things they would get more compensation for.
We need smart people at every layer. If leadership isn't in that category, it spreads to all layers.
I don't know how we defeat capitalism to incentivize smart leadership. It's fundamentally opposed to market forces.
So you're saying that peer reviews are a waste of time and only idiots would use/propose them?
To partially clarify: "Idiot proof" is a broad concept that here refers specifically to abstraction layers, more or less (e.g. a UI framework is a little "idiot proof"; a WYSIWYG builder is more "idiot proof"). With AI, it's complicated, but bad leadership is over-interpreting the "idiot proof" aspects of it. It's a phrase, not an insult to users of these tools.