Except this was not the case, it had of course hallucinated what the regulation actually required (I know this because the code in question had already been reviewed by human counsel). This is (supposedly) the most bleeding-edge model available.
We use a lot of genAI to help us write code, but there is no way in the mid-term we could ever rely on these tools to actually build compliant financial products. We'd have to be totally mad. Yes, lots of Fintech companies are using these agents to accelerate, but anyone who's using them to actually ship product without a human actually digging into it is opening themselves up to a world of risk.
My guess is the model makes the same mistakes as the programmers: taking 'rules' literally, unaware of sectoral joint understanding, validated interpretations and habits. (btw. this is often on the non-tech side also a difference between regulatory and legal. The former are much more result oriented while the latter are primarily risk averse.
My experience as IT in modern banks was the opposite. The legal department were absolute assholes when it came to software features. And I'm talking absolutely 100% ok features, like paying your bills from the banking application.
The least fun, trigger happy, cover their buts people I've ever seen.
Like all they could ever say was NO. I guess they were heavily incentivized to just say NO to everything.
I think adherence to regulation and compliance is nothing to do with whether you're a SWE, a risk officer, or C-level, and everything to do with your own principles, ethics, professional attitude, and pragmatism.
IME this is less the fault of IT and more so bad auditors that won't consider, or just don't understand, what compensating controls are. If it doesn't meet their little checklist exactly, they fail the audit.
This is such a nonsensical claim. If a company is asking someone from IT to read the regulations and implement them, then obviously you’re going to get something that conforms to the written specification they were provided.
But a company that does that is basically delegating both compliance and legal functions to IT. No sane company does that.
As an enterprise architect, these are all part of the meetings you have with compliance when you are working on major projects. I have had the privilege of working with some excellent compliance officers, and they are the opposite of the nay-saying caricature that is often painted of them. I found these people to be extremely creative and helpful, working together towards solutions rather than stalling or nixing viable progress.
It doesn't feel like we're living in the same world of regulation that existed prior to DOGE.
I'm not implying anything else. I used your own "literal" wording to refer to the "more strict than yours" interpretation.
I suppose I should have used scare quotes around "literal".
Company politics, feudal wars, fiefdom protections, backstabbing and outright sabotaging, now there's a daily occurrence and many minions are cannon fodder in those skirmishes, but they usually stay clear of regulatory issues minefields.
If the company you work for actually had such a no-fault culture, I doubt you'd be criticizing programmers so aggressively for being sticklers, but would instead be trying to understand and account for the systemic factors (including human factors) behind their behavior.
I don't see why developers should be in trouble. Developers don't make unilateral decisions on non-trivial compliance matters. A finding of non-compliance at a financial institution would typically be the result of an investigation, a disagreement with the regulator or a court ruling. It would come years after the organisation as a whole decided to adopt the interpretation in question.
Engineers are not shielded by their implementer role if they participate in illegal activity. James Robert Liang was a rank-and-file engineer for Volkswagen and he got jailed for his role the VW emissions scandal[1].
No matter how much an enterprise architect or compliance officer promises "it'll be fine" to the developer, the developer needs documented CYA. An enlightened organization would perhaps find ways to expedite that CYA documentation rather than demonizing programmers as a class.
[1] https://apnews.com/general-news-988ea2ae45694b37b320e68cefe3...
Then the rules should enumerate all the ways. From your posts, you come across as if programmers don't know what they are doing which is insulting to those who work in mission critical industries like aviation where a programmer could be criminally charged if he/she didn't implement the specs STRICTLY.
Is neither what I said nor believe.
There’s a reason it’s called “judgement”
My point was simply that it's easy to scoff at someone else being careful if it's their neck and not yours.
I've seen accidental non-compliance. I've seen what I would call negligent compliance, where a company attempted to be compliant but didn't meet full, correct compliance (one example I've seen is that a company assigned resources to compliance and forgot to increase resources as workload increased, causing them to be increasingly behind on compliance work), but I've never seen a company that just decided to pretend to be compliant knowing that they were not.
Security, GDPR, backups, build pipelines, disaster recovery, most of it will be faked, half-heartedly done once or ignored entirely.
Then there's the more abstract things like scalability, idempotency when integrating with external APIs, error recovery, accessibility, UX, etc.
Almost always that sort of stuff will have been entirely ignored, or there will be a fig leaf over a real mess of misunderstood standards or manual intervention steps.
Startup developers usually have to be generalists as they often wear many hats, so things that need deeper domain knowledge get done to a bare minimum.
But really that particular issue could have been solved by literally just telling it in a markdown file or instructions something like "verify all facts or compliance requirements with web search and include citations in responses".
“Verify all facts and compliance requirements” leaves enormous holes even if you assume the LLM has a concept of facts and requirements (it does not).
What facts? What requirements? For what industry? For what subset of that industry? For what country or countries that you will be doing business in? Are these current “facts” and “requirements” or is the LLM referencing a dusty article from 1992 for which the subject matter has been radically overhauled?
In my job I regularly see small but incredibly important mistakes like this lead to major issues. Some of those are human driven but increasingly the defense of the person responsible has turned into “Claude said it was fine though!”
Additionally, using a specific tool does not suddenly give the model common sense enough to say “this piece of information doesn’t answer the question of whether this solution fits in this specific industry at this time in this place”.
I remember hearing that 10 years ago about self-driving.
We need a lot more basic research into LLMs and also a lot cheaper hardware.
The current batch of LLMs will turn a lot of fields upside down, but not to the tune of $3tn or whatever crazy amounts are being invested right now.
IME people would benefit greatly from the process, albeit tedious and time-consuming, of testing out the same prompt sequence/session with the exact same model multiple times. It becomes clear extremely quickly how capable but unreliable and inconsistent a model can be even when given the same context. If you have ever completed a long, complicated task with an agent and then lost the session and tried doing the same thing again from scratch you may have had the experience of seeing the subtle changes that come up in the model's thinking which lead it to accept or reject certain paths and ignore or incorporate prompt instructions like the one you've provided.
We have long historical experience and innate tools for detecting and mitigating errors made by humans. If we can't apply those to automation, then even fewer total mistakes may end up being a worse outcome.
But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.
Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.
It might beat an underresourced human review, on time, efficiency, cost metrics. But on the metric of accuracy, throwing unlimited humans at a problem will still beat throwing unlimited AI at it
You can do that, sure. But doing so negates any improvements in speed the LLM brought. And at that point, you may as well just do it yourself to begin with.
I use GenAI tools when coding a lot, but I do not vibe code. I go through everything it generated, and we iterate. And yes, it doesn't save me a lot of time. But what it does do is free up mental capacity in a similar manner. But instead of syntax, it's more complicated patterns. Maybe I don't remember how to stitch something together, but i know it can be done. Instead of spending the time to look it up and then code it, I just tell it to do it for me.
Or are current AIs too similar for that to be fruitful?
regulation questions. even the simple ones, AI gets all the time wrong. it wasn't Mythos, but other models like opus.
I can adjust the view on this topic if/when we get access to mythos.
Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.
But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.
Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?
I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.
Another genuine question:
You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:
Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?
The parent is implying they would prefer an AI when working in the airline and health industry because it makes less errors. Read the comment again.
They have not said, "Hey, I work in the airline and health industry and I'd love to use AI for a couple of the bullshit IT UIs we have as long as we can put guardrails on the AI to stay in its lane."
I asked a yes or no question. The guardrails you can put to mitigate errors are the same guardrails pre-AI for the humans (tests, regressions, reviews). If you were wary of employing a top lead engineer with verifiable dementia prior to AI for a mission critical system, logic implies you should think twice giving that much responsibility to an AI as well.
> The hallucination thing I think is mostly overblown
Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.
>from speaking to colleagues it seems to vary wildly depending on which model and harness you are using
You have partially answered my question it would seem.
No, but the same can be said for your colleagues. You might call what the LLM does hallucinations, I'd call them mistakes. I think we have totally forgotten that humans make them all the time and are confidently wrong too.
Your original question, doesn't really get to the bottom of the point I'm trying to make, and I don't really feel it fairly represents the issue we are talking about here. They are not the same things.
Also, if a human does this, you can replace them and get a human who will not do it. The default for an LLM is to generate plausible-looking text that may or may not be completely incoherent. That is not the default for a human. Again, if you find that your colleague consistently fabricates APIs, you can hire someone who isn't crazy instead, but you cannot do the same with LLMs.
That's absolutely false. My collegues don't routinely and confidently invent apis that are not there, or spectacularly and repeatedly misunderstand the purpose of certain functions or exhibit extreme forgetfullness. Especially when I've warned them. Hallucinations and confabulations in otherwise healthy individuals are mental disorders. When I ask them why they made an certain kind of error, I can expect to get a reasonable answer. No one has uttered the phrase "Bob hallucinated again while writing those tests" when the Bob in question is a human.
Calling hallucinations simply mistakes does not seem to me to be a healthy way to reason about LLMs. I can ask a collegue how well they can program in Ada and adjust my expectations on productivity and bug rates. I can't ask an LLM how well they can code in Ada (just a throwaway example), or even how much of Ada was in its training data. I have to actually spend money and spend time code reviewing before I can even formulate any expectations at all.
Well too bad, the problem is that they also produce things much faster than humans so errors will compound quicker.
The problem is that sucks, even if all software engineers keep their jobs and salaries, the floor is still pulled out from under us. Imagine if a surgeons job was to supervise robot surgeons from a remote computer, or a woodworker just signs off on work before the machines do all the cutting and assembly. Sure they still have important jobs in their field but the soul & humanity of their skill is gone.
I would love to be able to say I pay the same amount of attention and am just as diligent and communicate as clearly with an agent, but it wouldn't be honest: I scan agent PRs for obvious mistakes or misinterpretation of what they've implemented.
With human colleagues I usually know them and their style, their way of working, so have a better idea what to look for. You also have a genuine return on providing feedback that helps coworkers learn and improve, whereas with agents, all the feedback you write is gone when the thing gets merged (unless your org has some kind of shared memory for its agents).
I don't have the answer for what the future looks like, but I suspect agent-type-1 reviews agent-type-2 is actually where we'll end up.
For me, AIs have actually made the job more soulful, not less. For one thing, it lets me use the part of my mind that is good at human language, not just the part of my mind that is good at software. This makes the job feel a bit less one-dimensional in terms of what parts of me are engaged while doing it. For another, I find it liberating to no longer have to think much about boilerplate code or to spend time roaming around the Internet looking up documentation of various language syntax and API details, the vast majority of which are arbitrary rather than being based on any kind of mathematical beauty. For me it makes the job more soulful that I can think of the job on a higher level instead of having to spend effort on arbitrary and tedious details.
Of course there is still the question of "will the job even exist in a few years, at least for more than a relatively small number of people?". But that's a separate question. For now at least, I am finding that for me AIs have brought a lot more soul and humanity to the job than it ever had before.
However, if I were just having to do things for the man, I might have a rather different take on all this.
Does the woodworker who shape using a handsaw use less "soul" than the one who uses a machine?
Does the musician who use a DAW and VSTs instead of analogue tape recorders create music with less "soul"?
Does the painter who buys acryllic paint instead of synthesizing their own dye from plants use less "soul"?
As technological innovation progresses, the barrier to creation falls. The process of creating something is not to be conflated with the final piece of art itself.
This isn't like the step from hand saws to power saws, and it's disingenuous to pretend like it is. This is what the startup machine has been doing to every industry... finding... "inefficiencies" and "optimizing" them.
Compare to this to prompting an LLM: “Generate a third person where game with a view from above where you can steal cars, shoot at people, run from the police, etc.” Anybody with access to the tool can do this, and the results are just another uninspiring GTA clone that you would imagine.
The latter is more like a carpenter ordering their “work” from alibaba then it is like using a skill saw.
It's when a woodworker, musician or painter completely outsources their work and just marks what's wrong, sending those parts back. Yes, the final art piece might be the same, but the artist definitely uses less of their "soul".
When industrialization hit, we definitely lost a ton of craftsmanship and craftsman, but a standard Ikea chair is less likely to wobble than the average chair at a much better price (for a random example). Yes, we traded artistry for convenience, but what we really did was bifurcate our needs between "some place stable to sit" from "a beautiful chair for my home". Most people wanted the former more than the latter, and the same applies to software.
If we split the roles into buckets, many woodworkers disappeared, some became artisans, some became designers for industrially-produced products, and some catered to Luddites for a long transitional period. Despite Anthropic's claims, SWEs won't disappear in a year but over a generation or two, no matter how good LLMs become.
Obviously software is much more complicated and integrated into other elements of business, which in a way makes it more vulnerable to AI taking over and in another way will be at the mercy of larger shifts to how businesses organize human roles and responsibilities. What we call "taste" comes down to "intent" - what the hell does a company do? What should it be doing and how should it operate? These will be the only questions that matter and the one thing LLMs can't replace since they will always choose the most default path. So I think human's roles will be to inject intent/taste at different levels of abstraction throughout an organization.
In addition the incentives are misaligned - the "artisan" made chair (in the past) wasn't likely made for aesthetic reasons - it was made to last long term and function. And if it wobbled or had any problems the original woodworker was probably around to fix it.
And this is fine. Developing new software with a really smart intern is the same, you, as an expert, need to bring your experience/expertise on the table to have everything right. Because experience needs time.
Did it do the correct job once you put the regulations doc(s) in the context?
Here's an example of what we will continue to see with folks fully immersed in gen AI psychosis:
"The creator of claude code said that he no longer writes code for about 6 months and now has Claude doing all his work now. He also said recently that he no longer prompts Claude and now has it running in loops and it is self-improving itself and performing better than a human!"
If the code produced by the LLM is perfect, the LLM takes the credit. But when a disaster happens, you cannot blame the LLM and it then falls on the human who did it.
I don't think SWEs heavily vibe-coding with LLMs realize the risk in not understanding what the code the LLM being produced is doing even after generating tests (lol). We will see more of this too. [0]
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...
Are people on HN still typing out functions by hand one character at a time?
It would be like a developer in 2020 claiming that he only writes assembly because compilers can’t be trusted. No one is taking that person seriously. If you chose a career in tech you made a decision to work in one of the fastest moving fields in human history. Now it’s time to get over it, learn the new tools and adapt.
Well I use tab completion, of course. And I copy-paste snippets from LLM more often than from SO now. But otherwise not much has changed in my career in the last 5 years. Is this different for you?
I'm not fundamentally opposed to code generation, and I use LLMs for some taks, but I don't see myself vibecoding whole pages of production code. I vibecoded a throwaway note-taking app for myself though.
If the AI is producing what you tell it to, why are you needed?
No, thank you. I have used the new tools, determined that they aren't helpful to me, and set them aside as I would with any other bad tool. I don't feel the need to let hype take the steering wheel.
Exactly. You are free to use openclaw or a coding agent to build a competing bank, hedge-fund, hospital or even a new airliner because the previous ones were built by humans. Surely an AI can do it better by itself.
So why haven't you done it yet?
Yes, me. Yes, I tried LLMs for what I am doing and will try again in few months. No, there was no noticeable or clear improvement over doing it manually.
Yes, I am using some LLMs for some purposes but Claude Code had slight improvement, if any, not worth introducing proprietary dependency.
I work at a big tech company and I don't know a single person that still hand writes code. Most people haven't hand written code for at least half a year now.
I do wonder what sort of bug is making its rounds on HN that people here find this so shocking and unbelievable.
Because we can actually see the disjointed slop that Anthropic produces. And when issues happen, they can't fix them for weeks on end because no one understands what code does anymore, and all of their "hard problems causing issues" they blog about are literally "if we had actual engineers this wouldn't even be an issue to begin with". Like this bullshit they had in spring: https://www.anthropic.com/engineering/april-23-postmortem
> It would be like a developer in 2020 claiming that he only writes assembly because compilers can’t be trusted.
LLMs are not compilers. For a few very obvious reasons I'll leave as an exercise to figure out
The original Mythos release used ASan to filter false-positives so it was able to maintain a good FPR, but when Mythos moves into domains that don't have a readily available oracle to help filter hits, the result is a deluge of false bullshit.
"Make it better" with no additional or reasonable previous explanation of what better might mean.
"AI will figure it out" not for pattern extraction, but for a full blown analysis with equally generic prompt all confidently stated by an executive telling people working it how it works
So the question remains if non-programmers will adapt, the LLMs will accept wider range of input styles, or .. its just another abstraction layer for devs to use.
I've observed this in the wild where someone is iterating with an LLM and giving it only negative feedback. For example responding to edits with "don't make it blue" rather than "keep the existing button shape, and change the color back to green".
The LLM doesn't really come back the way a human would and say "so what color do you want?".. it just, guesses. Now abstract that to more complex tasks.
you take a spec and create tests, every little thing
you use another ai to verify these tests against the spec
you review the tests vs the spec (at one point human review)
you put the tests off limits to change / wall them.
you let the ai write the software that fulfills the tests.
there will be some gaps where you repeat the cycle above
if the tests fulfill the spec, the code will fulfill the spec
A spec detailed enough and unambiguous enough to be translated into machine execution deterministically is called code.
Unlike a compiler, AI can build with a spec that is not detailed enough or unambiguous enough: It does so by filling in the gaps with educated guesses.
This is safe if and only if you take the time to later read the output, understand what its guesses were, and judge wether they were acceptable. No AI can do this for you because the truth lies in your original intentions, which it does not have access to.
The jury is out there on how reliable and time consuming this is vs writing the code yourself; it is not immediately obvious that is faster or requires a smaller cognitive load.
As for whether or not LLMs can write unit tests. The answer is yes.
> The system shall have behavior identical to that expressed by the system created by the following source code. [add some stuff about environment to taste]
Particularly as tokenmaxxing has ended and people are being charged more economic prices. If the pricing 5-10x the way Uber,etc did on the path to profitability.. even more so.
other than there are "internal micro feedback loops" during development?
Doing the above doesn't actually make the model smarter, so, if it couldn't get to correct code with fewer steps, then the light you see at the end of the tunnel is an oncoming train.
The only way to test this is to test it out, in real life. Sometimes people see results, sometimes people don't. Note that yes, I am including the entire iteration process - even after iterating, people still don't see results with AI.
I have had both positive and negative experiences with AI, over multi-week projects. But apparently on hackernews, anything positive about AI is proof that AI is superhuman and taking over, and all follies about AI are lies by stupid humans who secretly have psychological dispositions to fear AI. Sometimes the AI genuinely isn't good enough. Are we not allowed to say that now? We might not know why, but it's just the truth.
The other solution is to formally analyze the entire space of possible actions the agent can take a priori. Then yes, you can definitively say whether or not the principle breaks or not. Can you, though? Can you give a formal specification for the space of possible actions for AI and show that your loop never breaks, or breaks less than humans, or any other sensible criteria? If not, then you can't just give an abstract principle and start making inferences from that.
Did it find any real potential issue, optimization/simplification opportunities, or sparked any thought-provoking discussion within your organization?
Or was it purely a net negative experience?
You're the only one coming away thinking there was a net negative experience.
The only thought-ptovoking discussion should be "why the hell do we have this stochastic parrot anywhere near out codebase"
A system which will just randomly decide to give the legal team reasons to not back you up is:
* A system whose output will get brought up in lawsuits and make legal's job harder.
* A system that will make the dev team perpetually chase its tail while it oscillates between the several different valid interpretations of the rules.
Not saying that is the situation, I don’t know. But if “one error is too many” is your point of view… do you think the humans in these orgs are 100% perfect 100% of the time?
How many gaps have humans not caught?
> But if “one error is too many” is your point of view
Yes, in regulated industries "one error is too many" is the only right approach.
Yes, humans also make errors, and there you have a range of options: from tracing and finding the causes of the error (and tightening processes) to literally jailing those responsible. Your hallucination machine will happily "identify" 17 gaps, and create 34 more. And no, there are no processes to make it better. The "make no mistakes" incantation will happily be ignored for obvious reasons, regardless of how many forms of it you throw at it.
I love using AI tools as casinos. It's epic in helping to forge ideas and kickstart thought processes. You basically have the entirety of world knowledge at your fingertips to have a pint with.
> the code in question had already been reviewed by human counsel
The conversations had already been had and the product made compliant. Mythos just pulled new rules out of its ass and of course the product wasn't compliant with those. So they do a fire drill and find that to be the case at great expense.
Yeah you can frame it as "more checking is always better" if you wanted but that's just the same old "other people's resources are valueless" slight of hand we see on everything. It probably was mostly wasteful work.
So, in this case, the LLM's behavior was equivalent to the behavior of the resistance during WWII.
I think that book should be required reading for all engineering students.
But how are you so sure your colleagues are not more "expert" than you? Prior LLMs there was room for very good engineers and mediocre engineers to work together in 99% of the companies out there. With LLMs, only the "best" engineers will survive, because nobody needs mediocre engineers anymore.
This being HN, I imagine every engineer reading this thinks they are in top the 10-5% of their company/city/country, and therefore they think they are not "mediocre" engineers that can get affected by the introduction of LLMs. Statistically, they are probably wrong. So, it's all about ego. Chances are you are not a rockstar and LLMs will eventually take over your job.
As usual, the only winners here are corporations and executives. Most of us are the last monkeys in the chain, and so we'll get screwed.
But there are certainly 0.1x engineers
All of this to say that it's not just experience that makes one a better engineer.
This is giving too much credit to LLM. I think LLMs are great and it is incredibly useful both in personal and professional settings. However, it exist on a separate plane than human workers in the tools category.
Sooner or later, people will find out that LLMs only overlaps with existing human hierarchy (e.g. junior dev X%, senior dev Y%, etc), but almost never 100%. If it was 100% to a certain position, you are probably using the humans wrong to begin with there - since humans have one of the most priced thing that I don't see an single ounce out of LLMs: initiative
I don't think this is true.
A good engineer doesn't have infinite throughput. In my opinion the best engineers should be constantly bottlenecked because they solve difficult problems. They don't have time for grunt work. Every company needs less than perfect engineers, AI assisted or not.
LLMs are going to show that there's a huge divide in "engineers" between people who love "coding" and people who like "engineering".
The group of people kicking and screaming the most are the people who love code and don't want to see their coding go away.
These are typically the build vs buy folks. "We can't use anything anyone else wrote, I can do it better..."
What do you think Staff level engineers do? They don't sit around coding all day.
Writing the code is just something you had to do in the past to get the job done.
What you get paid to do is "engineer". The two are related, but they are separate. Coding is a very small part of the average engineer's job (and almost none at staff level and above).
And yet the vast majority of engineers think that the world is going to end if they aren't spending most of their time "coding".
The majority of my time is an engineering manager has been teaching “engineers” how to actually do engineering with any kind of rigor
The number of engineers who have an absolutely no theoretical structural or system basis for what they’re doing is the vast vast majority
Famously a net loss for humanity.
But, besides coding skills (which some possess), the engineering, social, and business ones are close to non existent.
There was also another study I cannot find where 56% of engineering graduates struggled to write a fizz buzz.
I think people highly underestimate how long is the average developer, closed in their bubbles of mostly well established software teams that forget that for each of them there's 10 software consultants in southern Europe glueing APIs with trial and error on Java 8 monstrosities.
Dunno how much longer that is going to remain true for your specific employer - all the fintech companies I deal with personally have had some sort of AI account for their devs since last year.
Even places like jane street have employees posting blogs (one of which was on HN frontpage about 60m ago) saying they mostly direct agents.
How long do you think your specific employer is going to hold out?
The real question is about accountability and liability.
When a major data leak is going to happen, who will they sue or fire ? That is the value engineers provide. They understand, confirm, and take ownership.
You, the IC, the developer prompting the code extruder, are ultimately responsible for its outputted code and its behaviour.
You may feel pressured to push out thousands of lines of code a day. You may see those thousands of lines refactored several times over the lifespan of a merge request. You may be asked to do this continue this in the long term with all the mental fatigue that entails.
When it's too much for you to sustainably deal with and you turn to using LLMs to review the code, that will still, presumably, fall on you at the end of the day.
The output is your responsibility.
I'm not even certain that laziness gets them further along than it used to; I think it's that people have not had their overconfidence painfully corrected yet. Behaviors will re-align pretty fast when people realize that no, they're not going to get away with just pressing a button and saying everything is "good". That is happening right now.
This goes for serious incidents, disasters, outages and security breaches.
If there was an investigation and the answer was "a piece of software was vibe coded with AI" why would anyone trust the software vendor after that?
Even Solarwinds is still alive.
So that is starting to dig deeper than a plain mistake. I guess we will soon-ish witness the first AI slop trial going on, this will be interesting to follow
If a nurse does something incorrectly, they can lose their license. Ensuring that nurse will never be a nurse again. There is a very clear path of accountability and very clear ways to mitigate it.
For instance, if a nurse is drunk and you recognize there is a pattern of people showing up drunk, you institute drug tests and breathalyzers and move on.
While we probably won't have LLM's autonomously performing procedures, they are 100% parsing documentation, reading lab results, making suggestions, etc. And right now, the burden has been placed squarely on the clinicians themselves. It'll feed them them the data, ask if they approve/agree, and then essentially wash their hands of accountability. Let's say an LLM starts incorrectly reading lab results, how is that fixed/remedied? A prompt update? Additional safeguards? Adjusting the temperature? Changing a model?
This is a far different type of engineering that still feels pretty new. Granted, I'm still an amateur in this space (I use Claude Code a decent bit), but it feels really opaque to me right.
It's a race to get first-to-market for backend integrations/features. It's given rise to a culture of "move fast break things" where safety is only for some core features, but absolutely not for the constellation of other services we provide. Failure rates have increased almost a percentage point since Codegen/LLM adoption was mandated from up top.
You would think regulators would be on top of this, but our industry runs on all actors "self reporting" their outages. Most don't unless they can't hide it (>1h)
Yes
I give LLMs snippets of text messages exchanges with my wife and I can't believe how dumb the LLMs are of getting basic facts right let alone nuance.
I'm 100% not one those "LLMs are just stochastic parrots" people, but coding and coding-like activities are extraordinarily well fit for LLMs, but for things that there's less training data, LLMs probably do a lot worse
I understand the frustration of spending years nurturing a skill and then seeing its value decline.But this isn’t really an LLM problem. The same thing happened to factory workers, typists, draftsmen, and many others before. The technology changes, but the underlying issue is the economic system we live in, where the market can suddenly decide that something you’ve spent years mastering is worth much less than before.
LLMs are not creating that dynamic. They’re just accelerating it.
I am not sure but for complex cases it seems to me that the earlier sum of moderately long PR time + moderately long review time has been replaced by very short PR time + even longer review time. I am not sure if there's a net gain in these cases. Sometimes even if the code is functionally correct, it's verbose enough (e.g., too many intermediate functions) that I think they will impact future reviews.
I'd posit there's another layer. You have domain knowledge, certainly. But more valuable still is the wisdom to find more.
Anthropic and OpenAI can stick financial regulations in the training data all they want, but the AI systems will never learn to anticipate the future, or reach out to clients, partners, or regulators in complicated situations.
Citation needed. I don’t see any reason these systems shouldn’t be able to speculate; indeed some would say that’s all they do, even about the past.
Most surprising to me about the article was the desire for OP's company to use AI for design docs. I feel like AI-generated design docs are some of the worst -- basically treating English as a programming language. They aren't enjoyable to read, and they often miss the forest for the trees. A human written sketch explaining why we're here and what we're working towards is still meaningful and important. If you want code-level details of every decision and algorithm, we have code for that.
I have mixed feelings on whether these documents are useful LLM inputs. I did a project where I carefully paired with Claude Code on producing a specification that another model would actually implement. I'm not sure it saved me any time, and it was very un-fun. (I kind of blame Opus 4.7 xhigh for this. It ain't speedy.) I feel like I can nitpick code to get exactly what I want, but defining exactly what I want an auto-mode LLM to go and do, in English, is much more difficult. I don't think the PLAN.md I generated would have been useful for a human trying to understand the system (too verbose), and Claude Code still made its usual mistakes that I have reminded it a billion times not to make (t.Context() in tests, not context.Background()!), so I'm just not sure it was worth it. I would say I probably wouldn't do it again in the near future. A rough sketch to get humans on board and to get the high level details worked out, written by hand, and then pairing with the LLM on actually typing in the code seems the most productive to me. But I do try to go outside my comfort zone once in a while to test the edges of these tools. They are very impressive and are worth a lot of the hype. (I know I will never write a YAML file again. I hate it more than anything, and Claude is amazing at it. But I worry I wouldn't feel the same way if I hadn't already had 8 years of k8s experience.)
Love the metaphor. Planes are sophisticated machines capable of auto-piloting, but humans are still needed to ultimately pilot the beast.
The real reason we have pilots in commerical jets is that they are a failsafe because a mistake in your automated system can kill 100+ people and guess what? It still does from time to time even with a human piloting the "beast."
It is not impossible to have fully automated planes. We just cannot afford to make any mistakes otherwise people will die so we keep pilots as failsafes.
Nobody will die if your claude code hallucinates.
I like your comment, want to try to expand on it
Comment long but there is a TL;DR at the bottom
My theory is that there are 4 areas to domain knowledge worth taking about here - there may be more but I like 2*2 matrices
1) explicit internal requirements - core of how the how the app should work towards achieving your business objectives - code expresses what should be done and to a pretty large extent, why it should be done - from business unit requirements - we are building a tool to do “X”
2) implicit internal requirements - core of how the how the app should work towards achieving your business parameters and constraints
eg profit = selling price - ( total of costs )
- code expresses what should be done but really can’t express why. At best it is in the comments
eg if market is EU then tax = 30% (or some value for a table), AI can see what is being done but rationale is not explicit3) implicit internal requirements - core of how the how the app should work towards achieving your business constraints - code expresses what should be done but really can’t express why. At best it is in the comments
eg if item is “rocket” , shipping = $1m ( we only make rockets in Antarctica and shipping from there is $1m)
4) implicit external requirements - core of how the how the app should work towards achieving your business constraints - code expresses what should be done but really can’t express why. At best it is in the comments
eg if item is “rocket” , add a 3 month gating stage to get approval from government to sell the item and do not collect payment till gating approved - AI can see the code but has no idea why it has to be done
These come from partners, regulation, compliance, auditability etc
So, my theory
AI can be good at the explicit stuff trivially (1, 2) but cannot be good at the implicit stuff (3,4)
It might be able to figure out implicit stuff needs to be done but will probably not be able to figure out why it needs to be done and it will definitely not be able to definitively figure out edge cases for when to do it / not do it
As long as you focus on implicit stuff, you will be fine for a little bit
TL;DR - become good and keep being good at being the person who understands the implicit external drivers of software dev
A lot of companies are investing money on “ai factories” that are join to automate a lot of software development (that is, steer LLMs) on the basis of jira tickets (or linear/trello cards or whatever).
I've seen first hand what less experienced developers produce using the same models, your 90% accuracy suddenly drops to 50%...
We had a PoC in place to get fabric, it had like 500 hours allocated for what I did in a week with cowork, and my product is actually on secure vnet network with Azure identity security with both a test and a production environment delivering actual data.
Cowork even made the damn powerpoint slideshows for decision makers.
The single saving grace right now is that it apparently isn't easy for everyone to do this yet. But I didn't use a whole lot of my knowledge on software engineering to make any of it happen, not even the pandas and arrow code that moves the data behind the scenes. I mainly used my knowledge of NIS2 compliance and general data architecture in a step-by-step process. To me anyone with common sense should be able of doing this, and I really don't think I'm special... but then I teach other people AI at our company and they can barely get it to create a running program. Which is fine for now, but I have to work another 20ish years before I retire, and by then a lot of young people will have grown up with AI, and like I said, I'm not special. I think the only thing that differentes me is that I mash the buttons until it works but also have decades of security and compliance hammered into me.
I've been a developer for 25 years
I learnt a lot about the domain and how to effectively write programs for it: PCI compliance, double-entry ledgers, escrows, reconciliation, payment lifecycles, bank transfer idempotency, etc.
It was, then, obvious that I should focus my career on becoming an expert on that domain to stand out as a professional and differentiate myself in a field that showed signs of an increasing need for domain specialists."
He said "Last year, I got hired by a company in the finance workspace.".