We have long historical experience and innate tools for detecting and mitigating errors made by humans. If we can't apply those to automation, then even fewer total mistakes may end up being a worse outcome.
But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.
Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.
It might beat an underresourced human review, on time, efficiency, cost metrics. But on the metric of accuracy, throwing unlimited humans at a problem will still beat throwing unlimited AI at it
You can do that, sure. But doing so negates any improvements in speed the LLM brought. And at that point, you may as well just do it yourself to begin with.
I use GenAI tools when coding a lot, but I do not vibe code. I go through everything it generated, and we iterate. And yes, it doesn't save me a lot of time. But what it does do is free up mental capacity in a similar manner. But instead of syntax, it's more complicated patterns. Maybe I don't remember how to stitch something together, but i know it can be done. Instead of spending the time to look it up and then code it, I just tell it to do it for me.
Or are current AIs too similar for that to be fruitful?
regulation questions. even the simple ones, AI gets all the time wrong. it wasn't Mythos, but other models like opus.
I can adjust the view on this topic if/when we get access to mythos.
Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.
But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.
Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?
I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.
Another genuine question:
You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:
Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?
The parent is implying they would prefer an AI when working in the airline and health industry because it makes less errors. Read the comment again.
They have not said, "Hey, I work in the airline and health industry and I'd love to use AI for a couple of the bullshit IT UIs we have as long as we can put guardrails on the AI to stay in its lane."
I asked a yes or no question. The guardrails you can put to mitigate errors are the same guardrails pre-AI for the humans (tests, regressions, reviews). If you were wary of employing a top lead engineer with verifiable dementia prior to AI for a mission critical system, logic implies you should think twice giving that much responsibility to an AI as well.
> The hallucination thing I think is mostly overblown
Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.
>from speaking to colleagues it seems to vary wildly depending on which model and harness you are using
You have partially answered my question it would seem.
No, but the same can be said for your colleagues. You might call what the LLM does hallucinations, I'd call them mistakes. I think we have totally forgotten that humans make them all the time and are confidently wrong too.
Your original question, doesn't really get to the bottom of the point I'm trying to make, and I don't really feel it fairly represents the issue we are talking about here. They are not the same things.
Also, if a human does this, you can replace them and get a human who will not do it. The default for an LLM is to generate plausible-looking text that may or may not be completely incoherent. That is not the default for a human. Again, if you find that your colleague consistently fabricates APIs, you can hire someone who isn't crazy instead, but you cannot do the same with LLMs.
That's absolutely false. My collegues don't routinely and confidently invent apis that are not there, or spectacularly and repeatedly misunderstand the purpose of certain functions or exhibit extreme forgetfullness. Especially when I've warned them. Hallucinations and confabulations in otherwise healthy individuals are mental disorders. When I ask them why they made an certain kind of error, I can expect to get a reasonable answer. No one has uttered the phrase "Bob hallucinated again while writing those tests" when the Bob in question is a human.
Calling hallucinations simply mistakes does not seem to me to be a healthy way to reason about LLMs. I can ask a collegue how well they can program in Ada and adjust my expectations on productivity and bug rates. I can't ask an LLM how well they can code in Ada (just a throwaway example), or even how much of Ada was in its training data. I have to actually spend money and spend time code reviewing before I can even formulate any expectations at all.
Well too bad, the problem is that they also produce things much faster than humans so errors will compound quicker.