undefined

upvote

points

by wolttam7 hours ago |

upvote

by an0malous7 hours ago|

[-]

> why are we concluding that bigger models and more data = more hallucination?

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations

The relevant quote for what you’re talking about would be:

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.

So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate

I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.

reply

upvote

by jmalicki6 hours ago|

[-]

I find these internet arguments talking about LLMs as if they are trained by reading the internet to be wild.

Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.

reply

upvote

by MattRogish4 hours ago|

[-]

I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.

There are a bounded number of (useful) derivations/combinations of Duff's device.

If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?

Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?

I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.

My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!

reply

upvote

by jmalicki4 hours ago|

[-]

As a side gig, I write novel software that solves problems no existing software does, that existing LLMs have difficulty reproducing, purely for the purpose of existing as LLM training data.

There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.

It's insane.

Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.

The largest characteristic of all of this new data is it is targeted at LLM's weak points.

It's not just more data, it's custom tutorials built for what LLMs struggle at.

reply

upvote

by MattRogish3 hours ago|

[-]

I'm not saying they are not trying - I'm saying we're inventing new problems faster than any Lab can:

1) Identify the gaps

2) Determine how to fix them

3) Implement a fix (especially if that fix is: identify and find experts)

4) And judge the result

How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?

You can find a lot of people who disagree on many topics, and those turtles go all the way down.

I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.

I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).

reply

upvote

by sroussey2 hours ago|

[-]

> How do they know [person] is an expert in [some field]? How do they find that person?

You just stumbled upon billion dollar businesses: Mercor, micro1, Scale AI, Surge AI, etc

reply

upvote

by jmalicki3 hours ago|

[-]

> How do they know [person] is an expert in [some field]? How do they find that person?

They have a PhD from a top school, they are a licensed attorney, they are a licensed physician, a board certified cardiologist, etc.

They are constantly recruiting from these populations with well-paying side gigs.

> 4) And judge the result

That's what they pay the experts for. And to have experts review the other experts with peer review.

> You can find a lot of people who disagree on many topics, and those turtles go all the way down.

Which is why everything has to be well-calibrated and not just a hot take - a well reasoned opinion any expert would find fair.

Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks. Can they move the bar on the complexity of software LLMs do on their own? Can they get to a point where LLMs can begin to replace physicians? Financial advisors? Actuaries? etc.

reply

upvote

by maxnevermind33 minutes ago|

[-]

That is informative, I was suspecting that is how models improve their performance on some convoluted "non-googlabe" benchmarks like SimpleBench, that is how, they just got the taste of those those questions from publicly available samples and then hired people to generate similar questions and provide answers for them.

I wonder if extracting those static reasoning chains make sense given a Rich Sutton's "The Bitter Lesson" and Geoffrey Hinton's "People should stop training radiologists now.". I guess until participants make money they won't stop, not sure if they do, so far it is more about expectation of profitability as I understand.

reply

upvote

by jmalicki4 minutes ago|

[-]

There is one level that these training data give examples of specific static reasoning chains.

Given exposure to enough reasoning chains, with training data that is designed around adversarial reasoning and teaching models to reason, these types of training data might be key to teaching models to reason beyond what they could gather from static data.

reply

upvote

by 10 minutes ago|

[-]

deleted

reply

upvote

by macleginn2 hours ago|

[-]

> Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks.

The boundary is pretty thin there though. E.g., Gemini recently told me that a certain papers claims that two frameworks are mathematically equivalent, while the paper shows the opposite, and yesterday Google's AI overview told me that no World Cup matches were scheduled for that day despite their being several of them. The model probably used complex reasoning to arrive at both (incorrect) answers, but superficially they look like basic errors of fact.

reply

upvote

by jmalicki2 hours ago|

[-]

That is a great example of the kind of thing they're paying people to create as training data.

You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.

reply

upvote

by giardini2 hours ago|

[-]

Ahhhh! the ever-present omniscient "they" of paranoia!

But be careful: they are watching you and they don't want you giving away their secrets!

reply

upvote

by ayewo4 hours ago|

[-]

1. How did you land the side gig? Mercor or a lessor known brand?

2. What criteria do such vendors typically require?

reply

upvote

by jmalicki4 hours ago|

[-]

I've done Mercor and other brands - the contracts move around, since the labs want the vendors to know they're just vendors and have to compete with each other. It seemed to be roughly resume and interview similar to getting hired at a senior role at FAANG or adjacent.

reply

upvote

by victorbjorklund4 hours ago|

[-]

What kind of programs? Can you give an example of the tasks?

reply

upvote

by giardini2 hours ago|

[-]

jmalicki says many things, among them being

"As a side gig, I write novel software that solves problems no existing software does,"

and

"Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data."

More likely you're joking and/or paranoid!8-))

reply

upvote

by nomel1 hours ago|

[-]

> I write novel software that solves problems no existing software does

This is actually really easy to do if you step out of web/gui/crud and into something where you won't find public code, most ever, because it's trade secret. For example, manufacturing.

reply

upvote

by jmalicki38 minutes ago|

[-]

There is also an endless fountain of things you come across every day and think "oh, wouldn't this complex solution to this low priority problem be cool", but noone ever implements it because it's too complex and the problem is low priority.

Anyone writing software for long enough has a long list of these things in the back of their head that are great fodder for LLM training data.

reply

upvote

by knollimar35 seconds ago|

[-]

I make a trello board to direct spare tokens at when I'm bored now!

reply

upvote

by jmalicki2 hours ago|

[-]

I wish our actual world wasn't an implausible scifi novel!

reply

upvote

by jgalt2122 hours ago|

[-]

Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.

reply

upvote

by jmalicki2 hours ago|

[-]

Which is why rubrics as rewards are used.

reply

upvote

by jgalt21210 minutes ago|

[-]

still cost prohibitive.

reply

upvote

by mcphage5 hours ago|

[-]

Where do they get the bespoke training data from? And how much? I don’t really know anything about this.

reply

upvote

by jmalicki3 hours ago|

[-]

> And how much?

Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.

So well into the billions of dollars a year for bespoke training data.

That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.

They are just one of many.

reply

upvote

by blovescoffee5 hours ago|

[-]

Companies like Mercor sell data from human experts

reply

upvote

by trothamel4 hours ago|

[-]

Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.

reply

upvote

by jmalicki4 hours ago|

[-]

The most advanced training data is in the form of rubrics as rewards.

A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.

https://arxiv.org/abs/2507.17746

reply

upvote

by dominotw4 hours ago|

[-]

meta has reallocated a significant protion of their staff to genrating this

reply

upvote

by sroussey2 hours ago|

[-]

Meta also reportedly took a 49% nonvoting stake in Scale AI in June 2025 for about $14.3–$14.8 billion.

reply

upvote

by dominotw4 hours ago|

[-]

let me take down armchair analysis with my armchair analysis

reply

upvote

by themgt5 hours ago|

[-]

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations ... I’m pretty sure #1 is well known

Well known in a multiverse branch where Fable was a dud?

reply

upvote

by an0malous4 hours ago|

[-]

No, well known in the current multiverse branch where we still occasionally use things like math and scientific analysis instead of people’s vibe checks and pelican SVGs.

Here’s the paper from OpenAI where Dario himself was a co-author: https://arxiv.org/pdf/2001.08361

> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.

reply

upvote

by themgt3 hours ago|

[-]

instead of people’s vibe checks and pelican SVGs.

Right, what happened is everyone went to Fable and asked it to make the very best bicycle pelican SVG, no mistakes. And Fable's bicycle pelican SVGs were such timeless masterpieces, we all instantly got AI psychosis. Happily, you were immune to this.

reply

upvote

by coffeefirst5 hours ago|

[-]

Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.

I can’t prove it but I suspect there’s a bit of that going on.

reply

upvote

by utkuumur4 hours ago|

[-]

I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.

reply

upvote

by msdz23 minutes ago|

[-]

> it hallucinates like crazy but looks like its by design to boost benchmarks.

Wasn’t there a discussion around some new-ish benchmark _punishing_ hallucination answers (over not replying at all) recently? Maybe in the not-so-distant future, this “spam replies until one’s correct” strategy won’t be able to game a benchmark much at all anymore.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by bilater2 hours ago|

[-]

Yeah not only is it totally unsubstantiated, the benchmarks are getting less useful to really show the difference between these models. Big model smell is still a thing and GLM 5.2 while impressive is not Fable class.

Here is something I would like people to chew on. Perhaps the smartest researchers in the world across multiple labs know more about this than we do? Perhaps they are aware of issues like the data wall and diminishing marginal returns. And perhaps they are being honest when they tell you there is no wall?

reply

upvote

by nathan_compton1 hours ago|

[-]

Are the smartest researchers in the world out there saying there isn't a wall? I don't know of any people doing the actual R&D who frequently make outrageous claims.

reply

upvote

by bilater1 hours ago|

[-]

https://x.com/polynoamial/status/2064210146558136827

reply

upvote

by djvdq56 minutes ago|

[-]

I'd say that as OpenAI employee he's kinda biased on the topic

reply

upvote

by eurekin7 hours ago|

[-]

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling

I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.

It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.

reply

upvote

by aurareturn5 hours ago|

[-]

Aren't hallucinations also heavily influenced by compute and memory capacity? IE. Companies can spend more time to verify results in an agentic format, spend more thinking tokens, and less quantization. All of these heavily depend on compute and memory but are proven to decrease hallucinations.

Maybe GPT 5.5 is heavily nerfed due to lack of compute, memory, and energy?

I agree that it's farfetched to conclude that bigger models have pleateued.

reply

upvote

by dominotw4 hours ago|

[-]

article specifically talks about this. deepseek spending significant test time with worse results than klm

reply

upvote

by aurareturn20 minutes ago|

[-]

So GLM is just a better model than DS then?

reply

upvote

by madduci7 hours ago|

[-]

Isn't that the case of over fitting? You have more data, but when you ask something that's not in that data, hallucinations happen

reply

upvote

by utopiah3 hours ago|

[-]

>> it is clear that actual intelligence has plateaued significantly.

> These are wild claims -

Indeed, it is not clear there was any actual intelligence at any point.

A lot of generated content sure, sometimes even useful, but not necessarily anything more.

reply

upvote

by ozgung2 hours ago|

[-]

What is the definition of "actual intelligence"? How does it differ from regular intelligence and non-intelligence?

If someone can "design a custom asyncio event loop policy in that overrides get_child_watcher()", I would call that person intelligent. Does that mean that person is not actually intelligent but a mere content creation machine?

Traditionally if you can create content, this shows you're intelligent. Created content is often called "intellectual" property. If a person can understand complex ideas and make connection between them, that is considered intellectual work. You have to be intelligent to do intellectual work. If a person can solve problems, this is also called intelligence. If the person can solve more complex problems, that person is said to have higher intelligence. This is often measured with a scale called IQ (Intelligence Quotient). There are other types of intelligence but they are basically the variations of the same ability. Most definitions of intelligence also involve an ability to adapt into the environment.

Since intelligence is such a broad concept what exactly is the difference between the actual intelligence and AI, other than one is natural and the other one is artificial?

I understand being anti-AI because of the very real societal concerns. But ignoring what is in front of you is not a solution.

reply

upvote

by coldtea7 hours ago|

[-]

>These are wild claims - why are we concluding that bigger models and more data = more hallucination?

Because that's what they measured in this case.

reply

upvote

by blurbleblurble6 hours ago|

[-]

How do we know gpt 5.5 is a bigger model

reply

upvote

by Phelinofist6 hours ago|

[-]

Since it was created by _Open_AI surely it's really open and we can check, right? SCNR

reply

upvote

by claytongulick4 hours ago|

[-]

My impression is that the fundamental issue is that LLMs attempt to extract reasoning (executive execution) from data (relationship between tokens).

There's an open question about whether this is theoretically possible, but it doesn't seem like it to me.

Human generated data is an effect of reasoning. Attempting to extract executive function from it is kind of like taking an anti-derivative of a function.

This has always seemed like the root of hallucinations to me. It sort of follows the parallels to lossy compression that a lot of people draw. You're extracting some characteristics by observing the relationship between tokens, and then trying to argue that those characteristics are equivalent to the thing that generated the original tokens.

Surely there's some sort of overlap there, but viewed that way, it seems obvious that more and more parameters and scaling won't solve the fundamental problem. There's only so much meaning you can extract from token relationships.

It's like trying to derive the shape of a flame from the smoke it produces.

The original intelligence that created those tokens was driven by a whole universe of inputs, from hormones to starlight to gravity, not to mention all of the strange things about consciousness and parapsychology that is so poorly understood.

The machines are definitely useful for a certain class of tasks - those that don't require much executive function, and the useful work mostly involves pattern matching.

The problem is, we seem to be mistaking effect for cause and imagining that these things have greater capabilities than they'll ever posess.

The investors that don't understand this are indeed going to learn a bitter lesson.

reply

upvote

by resters4 hours ago|

[-]

to train models to be smarter than they are, one needs examples and cases to train on, and once you get close to the top percentiles of human reasoning there is extremely little such material available.

You can create contrived logic problems, but they often turn into language games because English is not formal logic.

And you can train on "monty hall" style problems, but those too are language games that are intriguing to humans but obvious when framed slightly differently.

In other words, model trainers are fighting against the overwhelming mediocrity of the training corpus (all of the recorded human output from history).

As models improve, the next phase will be models co-designed with humans to overcome these limits. The way we use language and the process we use to problem solve (we currently call this "orchestration") will evolve as part of this. Meatspace metaphors map badly when we have massive context and don't need the same limits. How different is hallucination from extrapolation, etc.

Much of the skepticism and confusion about LLMs is no different than a person of average intelligence hearing a highly intelligent person explain something and considering the explanation gibberish, then arrogantly accusing the intelligent person of being unhelpful.

Much like dogs were domesticated from wolves to have traits that make them good around humans, LLMs will evolve around our limits, around our arrogance, around our aesthetic biases and prejudices. Intelligence and rationality is fundamentally not what most humans want from an LLM.

reply

upvote

by dominotw4 hours ago|

[-]

you mixed two random quotes from the article to create a strawman.

ofcourse you knew what you were doing but disappointing that this was top comment.

reply

upvote

by harrall5 hours ago|

[-]

In cognitive science, it appears your brain has two modes of thinking:

- A very parallel type of computation that is fast and generally accurate and integrates hundreds of variables. It’s sometimes labeled as intuition or system 1 thinking.

- A much slower, step by step, analytical type, commonly linked with your pre-frontal cortex (one of the newest parts of the brain). Sometimes called system 2 thinking.

Maybe the way the universe works is that all computation more or less is one of those two types. In which case, an LLM alone is only the first part, which is often right but its results also cannot ever be proven.

reply

upvote

by stevemk14ebr5 hours ago|

[-]

An LLM is not thinking, assuming and relating it to thought and universal truths is nonsense.

reply

upvote

by dgellow5 hours ago|

[-]

We inflicted that to ourselves by picking the most confusing terminology ever. "No, reasoning isn't thinking. No when the model says it thinks it's not actually thinking... No an agent isn't actually a creature with agency... No, when we say it hallucinates it doesn't, like, actually hallucinate"

reply

upvote

by triyambakam3 hours ago|

[-]

What were the alternatives?

reply

upvote

by brookst5 hours ago|

[-]

Did you mean sentient?

reply