undefined

upvote

points

by aesthesia19 hours ago |

upvote

by ComputerGuru5 hours ago|

[-]

I’m not disagreeing with you but at the same time, models don’t “know” anything in that binary sense. I’m not trying to get in the woods here, I genuinely mean that what you pass off as a simple explanation is actually incredibly nuanced. A fact appeared once in training data , a fact never appeared in the training data, a fact appeared ten times, a fact appeared a thousand times. Which does the model know? Facts aren’t stored as-is, they’re all broken down into their components and compressed in the weights. “Similar” facts that didn’t appear an overwhelming number of times get bundled together and eventually conflated. But then what is a similar fact? Which facts were entirely ablated vs which were bundled together with others effectively poisoning the pool but also giving it inference strength? The model doesn’t know anything and can never know what it knows or doesn’t know.

reply

upvote

by unshavedyak4 hours ago|

[-]

I often wonder how humans "know" things. I suspect (ignorant armchair) we have some ability to signal strength of those facts, via repetition. Without this layer of introspection i imagine LLMs can never avoid hallucination.

It obviously breaks down with humans too, given we so easily hallucinate and confuse things we "know". However i still suspect we're more reliable at probing information we've experienced vs not. Even if the case of poisoned knowledge, eg a crime scene accidentally implying information to a witness that the witness doesn't actually know, we still "know" that poisoned information via incorrect inference. Ie we "experienced" it.

Wonder what architecture would allow for this style of information/weight probing for an LLM.

reply

upvote

by in-silico18 hours ago|

[-]

Additionally, maybe it's easier for a model to realize that it doesn't know the answer when the question is easier.

If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult

reply

upvote

by andix6 hours ago|

[-]

I guess you can test that on hypotheticals. Ask about things after the knowledge cut off that never happened. Or ask things that are genuinely unsolvable.

reply

upvote

by reinitctxoffset15 hours ago|

[-]

Hallucination should be called "failure to ground".

Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.

I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.

reply

upvote

by sudosysgen14 hours ago|

[-]

This is missing a common failure mode, which is information past the knowledge cutoff. If you need info past that time they'll fail no matter how big or small the model is, so the hallucination rate can matter independently of the knowledge base. If all use-cases had a uniform risk of falling out of support, this would be a valid argument, but since it's often the case that a datapoint is guaranteed to fall out of support, the absolute ability to recognize that is crucial.

reply

upvote

by gymbeaux14 hours ago|

[-]

Those numbers are abysmal. Should we really be using LLMs to write our code? I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.

I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.

reply

upvote

by andybak10 hours ago|

[-]

I can't help but feel that people continually underestimate how bad human written code becomes over time. The exception is probably single-person passion projects or open source projects that maintain quality governance over time.

I strongly suspect most closed source code developed under commercial or internal pressure is pretty awful after a few years of development.

All LLM code has to do is suck less than existing code. And that's presuming the code quality doesn't improve as the models, the harnesses and our ways of working with them improve.

reply

upvote

by embedding-shape9 hours ago|

[-]

Sucky human-written code is still based on human understanding, which can change over time, be readjusted or solidified. People implement something wrong once, then update their perspective, then in the future does it right.

LLMs doesn't have this benefit. You forget to add the correct to the system prompt, and the LLM will repeat the same mistake over and over, and worse than that, their mistakes aren't based on their understanding, it's basically random guesses.

Humans, even bad coders, still seem to have some sort of architecture in mind, even if it's spaghetti, whereas LLMs (obviously) don't think more than a few steps, and never about the full scope of what they're contributing too, and on purpose too, because you want the context to be as small as possible when you work with LLMs.

With LLMs you need to thread carefully between "What does the LLM need to know?" and "Can I skip passing this to the LLM this time?" while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK.

reply

upvote

by andybak6 hours ago|

[-]

> their mistakes aren't based on their understanding, it's basically random guesses

Whilst I don't claim any true "understanding" as that is a very loaded term that doesn't mean it's just random guesses.

Anyone using recent LLM coding agents on a regular basis would probably agree that there's something going on that fits some non-athropomorphizing, non-sentience-assigning definition of "understanding"

As for the point about improvement - I think that's an orthogonal issue to the overall code quality. With regard to human codebases - there's plenty of scenarios that negate the improvement of individuals. We're comparing organizations with LLMs - not individuals with LLMs and that makes a significant difference.

reply

upvote

by 8note1 hours ago|

[-]

> while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK

i dont see why software engineers are paid so well, and are so hard to hire?

just dump a bunch of requirements on a homeless person and itll just work out

reply

upvote

by wizzledonker5 hours ago|

[-]

I think the real issue might be that how “good” the code is matters less than being able to form a mental model for what the human who wrote the code was “thinking”. If written by a machine, this contract is broken and we get more confused, even if our traditional methods of evaluating the code come out equal.

reply

upvote

by layer83 hours ago|

[-]

That doesn’t help the developers who have high standards.

reply

upvote

by andybak3 hours ago|

[-]

Yes. But that's not the point I'm addressing.

reply

upvote

by xzenor9 hours ago|

[-]

And where do you think the LLM learned coding from?

But anyway, let the LLM verify the code to give advice on improvements but don't let it write code unverified. That's my opinion on it anyway.

reply

upvote

by O5vYtytb9 hours ago|

[-]

I've been sent code from vendors that didn't even compile, long before llms were a thing. Most shops that aren't primarily software have really really terrible software.

reply

upvote

by xvinci11 hours ago|

[-]

Not my observation. If you never look at the code and dont have basic guardrails in place (linters, architecture tests, some guidelines for best practices) - probably.

But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.

Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)

But I never got the impression of unmaintainability or unfixable bugs.

Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.

reply

upvote

by VBprogrammer11 hours ago|

[-]

> Can there be bugs? Sure. That's the price of not reading or understanding every line.

I've yet to come across a human developer who's output would meet this standard, despite writing every line.

In fact, having an LLM review our code is catching quite a few bugs before it reaches QA.

reply

upvote

by ben_w10 hours ago|

[-]

Indeed, though I find the distribution is different.

The humans may skip unit tests and need reminding; the AI always write unit tests once it's in AGENTS.md or whatever, but my experience* was that 5-10% of the time the LLM's attempt at a "test" would, instead of executing the code and examining the results, open the source code as a text file and run a regex to find/exclude certain substrings.

* At the start of this year, because Anthropic and OpenAI were both offering free trials. IDK how much things have changed since then, some things change fast in this domain, other things don't.

reply

upvote

by baq10 hours ago|

[-]

I’ve been piloting LLMs for the past six months non stop and we’re at the point where formally verified models generated as an intermediate step between spec and code are very good value.

Riding the exponential means you have to update priors more often.

reply

upvote

by dezgeg9 hours ago|

[-]

I have seen some pre-AI over-mocked codebases where the "tests" where essentially that (but harder to read than regex would have been)

reply

upvote

by szundi11 hours ago|

[-]

[dead]

reply

upvote

by realaleris14911 hours ago|

[-]

Take a look at a sufficiently old random internal repo which was not written with LLMs and compare.

My observation is that they are equally bad and hard to maintain or even more so than the new ones.

One thing I’ve noticed is that the LLM assisted ones have a lot more comments which is nice but take more time to read.

reply

upvote

by otabdeveloper48 hours ago|

[-]

Yes, LLMs generate technical debt.

And they do it faster than any human developer.

reply

upvote

by rienbdj9 hours ago|

[-]

I have a theory that LLM generated code in a highly modular style (simple data, pure functions) will be easier to “recover” by a human team when the LLM gets muddled. So Haskell, basically.

reply

upvote

by realusername13 hours ago|

[-]

> code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time

They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.

reply

upvote

by ben_w10 hours ago|

[-]

I would say "only if you can review said work yourself alone", rather than "do".

I'm an experienced developer, but I don't count myself as a web dev or a python dev; I can review the web and python stuff I get out of the AI (sometimes I need to ask the AI follow-up questions so I can find official documentation for what it did), but I can't write it.

reply

upvote

by realusername7 hours ago|

[-]

I think you could eventually do it then, it would just take you longer.

reply

upvote

by ben_w7 hours ago|

[-]

If "eventually" counts, I can say I have "run" a marathon (I have walked that distance in one session, or if you don't like that verb I can sum all the various occasions I've run and that sum almost certainly exceeded 42.2 km before I finished school).

But the difference I allude to here is more like how "book reviewer" is a different job than "book author": yes, if you can review a book, you can also write one. Eventually.

reply

upvote

by csomar9 hours ago|

[-]

Easy fix: Code's basically free now, so just pipe your errors straight into an LLM and get instant patches. Sure, the patches themselves are broken too, but no worries! just pipe those back in again. Code's disposable now, fresh code generated on every request.

On a more serious note, I think the problem will be the inability to handle/maintain the systems once they are too big and nobody has no idea what's inside of them or what they do.

reply

upvote

by Espressosaurus4 hours ago|

[-]

Yeah, it’s so easy to generate code that you can do a whole codebase rewrite in a day.

Is this a good idea? Probably not—in the past we would only do that when the architecture was causing serious problems since it always has tons of behaviors that will accidentally not get carried forward, some of which are load bearing and will cause bugs.

Now we can do it in an afternoon and get the same long term bug behavior.

reply

upvote

by Foobar856813 hours ago|

[-]

Have you worked with enterprise apps? The ones I have used for decades are hot garbages.

reply

upvote

by IsTom12 hours ago|

[-]

Now imagine decades of LLM code. Extrapolating the rate of increase of LoC, the source code ain't gonna fit on hard drives anymore.

reply

upvote

by joka88xj8 hours ago|

[-]

[flagged]

reply

upvote

by grayhatter17 hours ago|

[-]

> Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.

Do you have a cite for this?

If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?

Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.

edit:

> and it's not totally clear that this is the main metric that's worth tracking.

I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?

Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?

reply

upvote

by aesthesia15 hours ago|

[-]

This isn't quite the point. When comparing two different models' hallucination rates, the denominator is different. The evaluation works more or less like this: for each question, the model has the option to answer or abstain, so there are three possible outcomes: the model answers and gets it right, the model answers and gets it wrong (hallucination), or the model abstains. The hallucination rate is (model answers wrong) / (model answers wrong or abstains). So if a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times. This is why hallucination rate is incomplete as a metric: it says nothing about the accuracy rate.

reply

upvote

by grayhatter1 hours ago|

[-]

The way you define the evaluation criteria seems very problematic[^1].

I don't understand the point of describing it as 3 possible outcomes. I objected to it because the only reason I would do something like that would be to obscure the severity of the model defects. I'm sure I'm missing something, but the reason I suspect that's how it's done, is to [intentionally] obscure the actual meaningful metrics.

I would expect any engineer to evaluate any model using accuracy, (error rate), and usefulness (definitive answer rate), as strictly independent metrics. Did it answer, and if it answered, did emit incorrect or misleading information and how many quantifiable bits of each.

The false negative rate (model confirmed to contain the requested output/information via other method but was unable to for the given test) is significant, but given a non-definitive answer is significantly different from a definitive and incorrect answer. Why would you want to group hallucinations?

Number/rate of useful answers (correct and incorrect) and error rate (given any answer how often will that answer be defective in some way).

To be clear, I'm differentiating hallucination rate from eagerness to answer, even though they're obviously linked because I believe presenting 20 correct answers, 20 incorrect answers, and 60 abstentions as a hallucination rate of 25% as obviously malicious. If I give you 40 answers, 20 correct and 20 incorrect. the error rate is 50% and if it refused an additional 60 times, it's usefulness rate would be 40%... arguably 20% depending on how strict you choose to be about the definition of useful. The matrix we should be using is a 2x2 true positive, false positive, true negative, false negative. But being that honest that might make the model look bad!

[^1]: just in case it's unclear, I'm using you exclusively rhetorically. I don't think you personally are being misleading, only that you're explaining how it's done... but that's the problem isn't it.

reply

upvote

by jpalomaki9 hours ago|

[-]

As human I also give wrong answers if if I know the right one. Sometimes I also give answers even when I don’t really know them.

When pushed, I then start thinking and realise my mistake. System 1 vs 2?

reply

upvote

by big_paps3 hours ago|

[-]

I realized that people from india often show this kind of behaviour in my experience . They superconfidently give you a wrong answer and walk away or even help by making things worse and then dissappear shrugging .. Are you from india ?

reply

upvote

by grayhatter5 hours ago|

[-]

That's weird, why do you do that?

When someone asks a question, if I don't know the answer; I say I don't know.

System 1 vs 2 doesn't really matter... I won't use an LLM that's willing to make up random shit. Equally I also won't work with a human who does that. Trust and confidence a system will function correctly is an important quality, in both humans and genai

reply

upvote

by sgc16 hours ago|

[-]

Since models just output the the most probable tokens and you can never accuse them of doing anything other than making it all up, I would like to see these tests run with a prompt that attempts to mitigate hallucination and finishes with something like: "Telling me that you don't have the relevant information or that the task is impossible is extremely useful to me and a valid answer", and see how much that changes the scoring - as well as the usefulness of the answers. There are so many skills like context7 that can be tweaked to improve these results as well.

In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.

reply

upvote

by grayhatter15 hours ago|

[-]

> In other words, you shouldn't choose the model that hallucinates the least without detailed prompting

You're prompting it wrong is quickly becoming the new, you're holding it wrong.

It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.

Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence

reply

upvote

by ordersofmag7 hours ago|

[-]

It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant--no matter how much we want it to be and no matter how much the fluency of the output tricks us into thinking there's a human-like mind behind it.

Now granted, if the boat salesmen were pushing hard on the idea that the boat would fly and even put little wings on the side and I bought the boat I might get really angry when I found out that it didn't fly. And I might angrily storm into the salesroom yelling about how the design is defective. But if someone pointed out 'hey, it's a boat perhaps you should stick to sailing around in it and stop getting your undies in a bundle about it not flying' the correct response is probably to take a closer look, ignore the salesmen, and cruise around the lake. LLM's are quite handy at some things and have some weird limits. Learn the limits, enjoy your time at sea.

reply

upvote

by grayhatter5 hours ago|

[-]

> It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant

It's not that you're holding it wrong, you're just wrong for expecting it to work the way it's described (able to one shot most problems these days).

reply

upvote

by epihelix11 hours ago|

[-]

[dead]

reply

upvote

by luuundonjk12 hours ago|

[-]

there is a difference between a human knowingly bullshitting and being confident because he misremembers something

reply

upvote

by master-lincoln10 hours ago|

[-]

there is a difference in their intent, but not necessarily in the effect.

reply