undefined

points

[-]

They might make fewer mistakes, but they aren't evenly distributed. They don't use logic when making mistakes, it is gaps in the training data and now large of a span they have to bridge in the latent space. Just as they aren't smart like humans, they aren't stupid like humans. Don't mistake rate for quality.

by Terr_1 hours ago|

parent|

[-]

Yeah, this starts to overlap with some autonomous vehicle stuff, where I like to say that the rate of errors is not the shape or distribution of errors.

We have long historical experience and innate tools for detecting and mitigating errors made by humans. If we can't apply those to automation, then even fewer total mistakes may end up being a worse outcome.

by csallen4 hours ago|

prev|

[-]

For some reason, tons of people seem to be in camps at both extremes. It's either "AI sucks don't trust it!" or "AI is so much better than humans!"

But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.

Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.

by hammock4 hours ago|

parent|

[-]

AI review is never going to beat a fully resourced human review.

It might beat an underresourced human review, on time, efficiency, cost metrics. But on the metric of accuracy, throwing unlimited humans at a problem will still beat throwing unlimited AI at it

by esafak3 hours ago|

parent|

[-]

That's an irrelevant comparison because cost is always a constraint, so there are not going to be unlimited AI or humans. The question is how to optimally combine them for a given cost.

by bigstrat20033 hours ago|

parent|

prev|

[-]

> Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI.

You can do that, sure. But doing so negates any improvements in speed the LLM brought. And at that point, you may as well just do it yourself to begin with.

by jghn3 hours ago|

parent|

[-]

When Google showed up on the scene I found I no longer needed to memorize basic syntax and other such things. If I couldn't remember on the fly, i'd just do a quick google search and move on. This freed space in my mind to instead focus on bigger & better things.

I use GenAI tools when coding a lot, but I do not vibe code. I go through everything it generated, and we iterate. And yes, it doesn't save me a lot of time. But what it does do is free up mental capacity in a similar manner. But instead of syntax, it's more complicated patterns. Maybe I don't remember how to stitch something together, but i know it can be done. Instead of spending the time to look it up and then code it, I just tell it to do it for me.

by skillina2 hours ago|

parent|

prev|

[-]

Yeah, humans reviewing the AI review can only detect the false positives, where the LLM claims something is non-compliant and flags it for review/correction by a human or another agent. Human review can’t find the false negatives (true deficiencies not flagged) unless you do a full audit yourself to find whatever deficiencies the AI missed.

by csallen2 hours ago|

parent|

prev|

[-]

I feel like you're missing the point that it's more thorough to use both. Speed isn't the only factor that matters.

by BurningFrog2 hours ago|

parent|

prev|

[-]

This makes sense, but a logical next step is to have one AI write code, and then have another AI, instead of humans, verify it.

Or are current AIs too similar for that to be fruitful?

by suttontom1 hours ago|

parent|

[-]

This is commonly known as "LLM-as-a-judge" and anecdotally multiple people I know who write code using OpenRouter or using multiple models say it's surprisingly effective. It's strange that there don't appear to be any major papers on it since ~early 2025, which at this point is basically ancient history.

by criticalfault3 hours ago|

prev|

[-]

not according. to my experience.

regulation questions. even the simple ones, AI gets all the time wrong. it wasn't Mythos, but other models like opus.

I can adjust the view on this topic if/when we get access to mythos.

by sillyfluke4 hours ago|

prev|

[-]

>I'd still place a bet that the SOA models make _far_ less mistakes than humans.

Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.

But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.

Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?

I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.

Another genuine question:

You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:

Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?

by deanc4 hours ago|

parent|

[-]

Your top coder has guard rails in place to prevent him autonomously going free - right? This is how you should approach agentic development with LLMs. Like it or not, we are the final bastion, the gatekeepers. The hallucination thing I think is mostly overblown and from speaking to colleagues it seems to vary wildly depending on which model and harness you are using - always go for SOA. In the last 3 months I can count on one hand where it's done something wrong and that's primarily as I'm operating it with guard rails and giving it context.

by sillyfluke3 hours ago|

parent|

[-]

>Your top coder has guard rails in place to prevent him autonomously going free - right?

The parent is implying they would prefer an AI when working in the airline and health industry because it makes less errors. Read the comment again.

They have not said, "Hey, I work in the airline and health industry and I'd love to use AI for a couple of the bullshit IT UIs we have as long as we can put guardrails on the AI to stay in its lane."

I asked a yes or no question. The guardrails you can put to mitigate errors are the same guardrails pre-AI for the humans (tests, regressions, reviews). If you were wary of employing a top lead engineer with verifiable dementia prior to AI for a mission critical system, logic implies you should think twice giving that much responsibility to an AI as well.

> The hallucination thing I think is mostly overblown

Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.

>from speaking to colleagues it seems to vary wildly depending on which model and harness you are using

You have partially answered my question it would seem.

by deanc2 hours ago|

parent|

[-]

> Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.

No, but the same can be said for your colleagues. You might call what the LLM does hallucinations, I'd call them mistakes. I think we have totally forgotten that humans make them all the time and are confidently wrong too.

Your original question, doesn't really get to the bottom of the point I'm trying to make, and I don't really feel it fairly represents the issue we are talking about here. They are not the same things.

by vor_17 minutes ago|

parent|

[-]

If a human was hallucinating and polluting a codebase with errors, they would be fired and possibly treated for dementia. Even worse, an LLM is trained to produce plausible-looking results, so it's harder to detect the mistakes.

by suttontom1 hours ago|

parent|

prev|

[-]

This is such a tired, meaningless argument. I've never seen a human in 10 years of professional software engineering at a large company ever so confidently, consistently create and send out seemingly well-reasoned code that's as wrong as what SOTA models using CC or Codex do. If a human did this, they would be fired or perpetually remain a junior who no one wants to work with.

Also, if a human does this, you can replace them and get a human who will not do it. The default for an LLM is to generate plausible-looking text that may or may not be completely incoherent. That is not the default for a human. Again, if you find that your colleague consistently fabricates APIs, you can hire someone who isn't crazy instead, but you cannot do the same with LLMs.

by sillyfluke2 hours ago|

parent|

prev|

[-]

>No, but the same can be said for your colleagues.

That's absolutely false. My collegues don't routinely and confidently invent apis that are not there, or spectacularly and repeatedly misunderstand the purpose of certain functions or exhibit extreme forgetfullness. Especially when I've warned them. Hallucinations and confabulations in otherwise healthy individuals are mental disorders. When I ask them why they made an certain kind of error, I can expect to get a reasonable answer. No one has uttered the phrase "Bob hallucinated again while writing those tests" when the Bob in question is a human.

by deanc1 hours ago|

parent|

[-]

Well, your experience doesn't align with mine. I have been using, and in part of an organisation that is extensively using, Claude with Opus for everything for about 3 months now and I am not experiencing the problems you describe. We'll have to agree to disagree here.

by sillyfluke1 hours ago|

parent|

[-]

That is fine. "Your experience may vary" is the crux of my argument amusingly. You can't have just realized that people are having different experiences using AI, or even that the same person has different experiences when they change domains or technical contexts. There's been lots of comments littered on this forum to that effect.

Calling hallucinations simply mistakes does not seem to me to be a healthy way to reason about LLMs. I can ask a collegue how well they can program in Ada and adjust my expectations on productivity and bug rates. I can't ask an LLM how well they can code in Ada (just a throwaway example), or even how much of Ada was in its training data. I have to actually spend money and spend time code reviewing before I can even formulate any expectations at all.

by 5 hours ago|

prev|

[-]

deleted

by realusername4 hours ago|

prev|

[-]

> I'd still place a bet that the SOA models make _far_ less mistakes than humans.

Well too bad, the problem is that they also produce things much faster than humans so errors will compound quicker.

by porridgeraisin4 hours ago|

prev|

[-]

This stupid argument again. The number of mistakes _does not matter_. Get. This. In. Your. Head. The predictability of the _type_ of error is what matters. For LLMs and machine learning in general the error distribution is not what you would expect and it is not possible to predict either.