That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations
The relevant quote for what you’re talking about would be:
> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.
So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate
I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.
Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.
There are a bounded number of (useful) derivations/combinations of Duff's device.
If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?
Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?
I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.
My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!
There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.
It's insane.
Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.
The largest characteristic of all of this new data is it is targeted at LLM's weak points.
It's not just more data, it's custom tutorials built for what LLMs struggle at.
1) Identify the gaps
2) Determine how to fix them
3) Implement a fix (especially if that fix is: identify and find experts)
4) And judge the result
How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?
You can find a lot of people who disagree on many topics, and those turtles go all the way down.
I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.
I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).
You just stumbled upon billion dollar businesses: Mercor, micro1, Scale AI, Surge AI, etc
They have a PhD from a top school, they are a licensed attorney, they are a licensed physician, a board certified cardiologist, etc.
They are constantly recruiting from these populations with well-paying side gigs.
> 4) And judge the result
That's what they pay the experts for. And to have experts review the other experts with peer review.
> You can find a lot of people who disagree on many topics, and those turtles go all the way down.
Which is why everything has to be well-calibrated and not just a hot take - a well reasoned opinion any expert would find fair.
Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks. Can they move the bar on the complexity of software LLMs do on their own? Can they get to a point where LLMs can begin to replace physicians? Financial advisors? Actuaries? etc.
I wonder if extracting those static reasoning chains make sense given a Rich Sutton's "The Bitter Lesson" and Geoffrey Hinton's "People should stop training radiologists now.". I guess until participants make money they won't stop, not sure if they do, so far it is more about expectation of profitability as I understand.
Given exposure to enough reasoning chains, with training data that is designed around adversarial reasoning and teaching models to reason, these types of training data might be key to teaching models to reason beyond what they could gather from static data.
The boundary is pretty thin there though. E.g., Gemini recently told me that a certain papers claims that two frameworks are mathematically equivalent, while the paper shows the opposite, and yesterday Google's AI overview told me that no World Cup matches were scheduled for that day despite their being several of them. The model probably used complex reasoning to arrive at both (incorrect) answers, but superficially they look like basic errors of fact.
You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.
But be careful: they are watching you and they don't want you giving away their secrets!
2. What criteria do such vendors typically require?
"As a side gig, I write novel software that solves problems no existing software does,"
and
"Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data."
More likely you're joking and/or paranoid!8-))
This is actually really easy to do if you step out of web/gui/crud and into something where you won't find public code, most ever, because it's trade secret. For example, manufacturing.
Anyone writing software for long enough has a long list of these things in the back of their head that are great fodder for LLM training data.
Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.
So well into the billions of dollars a year for bespoke training data.
That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.
They are just one of many.
A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.
Well known in a multiverse branch where Fable was a dud?
Here’s the paper from OpenAI where Dario himself was a co-author: https://arxiv.org/pdf/2001.08361
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
Right, what happened is everyone went to Fable and asked it to make the very best bicycle pelican SVG, no mistakes. And Fable's bicycle pelican SVGs were such timeless masterpieces, we all instantly got AI psychosis. Happily, you were immune to this.
I can’t prove it but I suspect there’s a bit of that going on.
Wasn’t there a discussion around some new-ish benchmark _punishing_ hallucination answers (over not replying at all) recently? Maybe in the not-so-distant future, this “spam replies until one’s correct” strategy won’t be able to game a benchmark much at all anymore.
Here is something I would like people to chew on. Perhaps the smartest researchers in the world across multiple labs know more about this than we do? Perhaps they are aware of issues like the data wall and diminishing marginal returns. And perhaps they are being honest when they tell you there is no wall?
I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.
It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.
Maybe GPT 5.5 is heavily nerfed due to lack of compute, memory, and energy?
I agree that it's farfetched to conclude that bigger models have pleateued.
> These are wild claims -
Indeed, it is not clear there was any actual intelligence at any point.
A lot of generated content sure, sometimes even useful, but not necessarily anything more.
If someone can "design a custom asyncio event loop policy in that overrides get_child_watcher()", I would call that person intelligent. Does that mean that person is not actually intelligent but a mere content creation machine?
Traditionally if you can create content, this shows you're intelligent. Created content is often called "intellectual" property. If a person can understand complex ideas and make connection between them, that is considered intellectual work. You have to be intelligent to do intellectual work. If a person can solve problems, this is also called intelligence. If the person can solve more complex problems, that person is said to have higher intelligence. This is often measured with a scale called IQ (Intelligence Quotient). There are other types of intelligence but they are basically the variations of the same ability. Most definitions of intelligence also involve an ability to adapt into the environment.
Since intelligence is such a broad concept what exactly is the difference between the actual intelligence and AI, other than one is natural and the other one is artificial?
I understand being anti-AI because of the very real societal concerns. But ignoring what is in front of you is not a solution.
Because that's what they measured in this case.
There's an open question about whether this is theoretically possible, but it doesn't seem like it to me.
Human generated data is an effect of reasoning. Attempting to extract executive function from it is kind of like taking an anti-derivative of a function.
This has always seemed like the root of hallucinations to me. It sort of follows the parallels to lossy compression that a lot of people draw. You're extracting some characteristics by observing the relationship between tokens, and then trying to argue that those characteristics are equivalent to the thing that generated the original tokens.
Surely there's some sort of overlap there, but viewed that way, it seems obvious that more and more parameters and scaling won't solve the fundamental problem. There's only so much meaning you can extract from token relationships.
It's like trying to derive the shape of a flame from the smoke it produces.
The original intelligence that created those tokens was driven by a whole universe of inputs, from hormones to starlight to gravity, not to mention all of the strange things about consciousness and parapsychology that is so poorly understood.
The machines are definitely useful for a certain class of tasks - those that don't require much executive function, and the useful work mostly involves pattern matching.
The problem is, we seem to be mistaking effect for cause and imagining that these things have greater capabilities than they'll ever posess.
The investors that don't understand this are indeed going to learn a bitter lesson.
You can create contrived logic problems, but they often turn into language games because English is not formal logic.
And you can train on "monty hall" style problems, but those too are language games that are intriguing to humans but obvious when framed slightly differently.
In other words, model trainers are fighting against the overwhelming mediocrity of the training corpus (all of the recorded human output from history).
As models improve, the next phase will be models co-designed with humans to overcome these limits. The way we use language and the process we use to problem solve (we currently call this "orchestration") will evolve as part of this. Meatspace metaphors map badly when we have massive context and don't need the same limits. How different is hallucination from extrapolation, etc.
Much of the skepticism and confusion about LLMs is no different than a person of average intelligence hearing a highly intelligent person explain something and considering the explanation gibberish, then arrogantly accusing the intelligent person of being unhelpful.
Much like dogs were domesticated from wolves to have traits that make them good around humans, LLMs will evolve around our limits, around our arrogance, around our aesthetic biases and prejudices. Intelligence and rationality is fundamentally not what most humans want from an LLM.
ofcourse you knew what you were doing but disappointing that this was top comment.
- A very parallel type of computation that is fast and generally accurate and integrates hundreds of variables. It’s sometimes labeled as intuition or system 1 thinking.
- A much slower, step by step, analytical type, commonly linked with your pre-frontal cortex (one of the newest parts of the brain). Sometimes called system 2 thinking.
Maybe the way the universe works is that all computation more or less is one of those two types. In which case, an LLM alone is only the first part, which is often right but its results also cannot ever be proven.