undefined

upvote

points

by jmalicki6 hours ago |

upvote

by MattRogish4 hours ago|

[-]

I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.

There are a bounded number of (useful) derivations/combinations of Duff's device.

If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?

Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?

I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.

My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!

reply

upvote

by jmalicki4 hours ago|

[-]

As a side gig, I write novel software that solves problems no existing software does, that existing LLMs have difficulty reproducing, purely for the purpose of existing as LLM training data.

There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.

It's insane.

Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.

The largest characteristic of all of this new data is it is targeted at LLM's weak points.

It's not just more data, it's custom tutorials built for what LLMs struggle at.

reply

upvote

by MattRogish3 hours ago|

[-]

I'm not saying they are not trying - I'm saying we're inventing new problems faster than any Lab can:

1) Identify the gaps

2) Determine how to fix them

3) Implement a fix (especially if that fix is: identify and find experts)

4) And judge the result

How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?

You can find a lot of people who disagree on many topics, and those turtles go all the way down.

I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.

I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).

reply

upvote

by sroussey2 hours ago|

[-]

> How do they know [person] is an expert in [some field]? How do they find that person?

You just stumbled upon billion dollar businesses: Mercor, micro1, Scale AI, Surge AI, etc

reply

upvote

by jmalicki3 hours ago|

[-]

> How do they know [person] is an expert in [some field]? How do they find that person?

They have a PhD from a top school, they are a licensed attorney, they are a licensed physician, a board certified cardiologist, etc.

They are constantly recruiting from these populations with well-paying side gigs.

> 4) And judge the result

That's what they pay the experts for. And to have experts review the other experts with peer review.

> You can find a lot of people who disagree on many topics, and those turtles go all the way down.

Which is why everything has to be well-calibrated and not just a hot take - a well reasoned opinion any expert would find fair.

Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks. Can they move the bar on the complexity of software LLMs do on their own? Can they get to a point where LLMs can begin to replace physicians? Financial advisors? Actuaries? etc.

reply

upvote

by maxnevermind30 minutes ago|

[-]

That is informative, I was suspecting that is how models improve their performance on some convoluted "non-googlabe" benchmarks like SimpleBench, that is how, they just got the taste of those those questions from publicly available samples and then hired people to generate similar questions and provide answers for them.

I wonder if extracting those static reasoning chains make sense given a Rich Sutton's "The Bitter Lesson" and Geoffrey Hinton's "People should stop training radiologists now.". I guess until participants make money they won't stop, not sure if they do, so far it is more about expectation of profitability as I understand.

reply

upvote

by jmalicki2 minutes ago|

[-]

There is one level that these training data give examples of specific static reasoning chains.

Given exposure to enough reasoning chains, with training data that is designed around adversarial reasoning and teaching models to reason, these types of training data might be key to teaching models to reason beyond what they could gather from static data.

reply

upvote

by 7 minutes ago|

[-]

deleted

reply

upvote

by macleginn2 hours ago|

[-]

> Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks.

The boundary is pretty thin there though. E.g., Gemini recently told me that a certain papers claims that two frameworks are mathematically equivalent, while the paper shows the opposite, and yesterday Google's AI overview told me that no World Cup matches were scheduled for that day despite their being several of them. The model probably used complex reasoning to arrive at both (incorrect) answers, but superficially they look like basic errors of fact.

reply

upvote

by jmalicki2 hours ago|

[-]

That is a great example of the kind of thing they're paying people to create as training data.

You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.

reply

upvote

by giardini2 hours ago|

[-]

Ahhhh! the ever-present omniscient "they" of paranoia!

But be careful: they are watching you and they don't want you giving away their secrets!

reply

upvote

by ayewo4 hours ago|

[-]

1. How did you land the side gig? Mercor or a lessor known brand?

2. What criteria do such vendors typically require?

reply

upvote

by jmalicki4 hours ago|

[-]

I've done Mercor and other brands - the contracts move around, since the labs want the vendors to know they're just vendors and have to compete with each other. It seemed to be roughly resume and interview similar to getting hired at a senior role at FAANG or adjacent.

reply

upvote

by victorbjorklund4 hours ago|

[-]

What kind of programs? Can you give an example of the tasks?

reply

upvote

by giardini2 hours ago|

[-]

jmalicki says many things, among them being

"As a side gig, I write novel software that solves problems no existing software does,"

and

"Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data."

More likely you're joking and/or paranoid!8-))

reply

upvote

by nomel1 hours ago|

[-]

> I write novel software that solves problems no existing software does

This is actually really easy to do if you step out of web/gui/crud and into something where you won't find public code, most ever, because it's trade secret. For example, manufacturing.

reply

upvote

by jmalicki36 minutes ago|

[-]

There is also an endless fountain of things you come across every day and think "oh, wouldn't this complex solution to this low priority problem be cool", but noone ever implements it because it's too complex and the problem is low priority.

Anyone writing software for long enough has a long list of these things in the back of their head that are great fodder for LLM training data.

reply

upvote

by jmalicki2 hours ago|

[-]

I wish our actual world wasn't an implausible scifi novel!

reply

upvote

by jgalt2122 hours ago|

[-]

Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.

reply

upvote

by jmalicki2 hours ago|

[-]

Which is why rubrics as rewards are used.

reply

upvote

by jgalt2128 minutes ago|

[-]

still cost prohibitive.

reply

upvote

by mcphage5 hours ago|

[-]

Where do they get the bespoke training data from? And how much? I don’t really know anything about this.

reply

upvote

by jmalicki3 hours ago|

[-]

> And how much?

Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.

So well into the billions of dollars a year for bespoke training data.

That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.

They are just one of many.

reply

upvote

by blovescoffee5 hours ago|

[-]

Companies like Mercor sell data from human experts

reply

upvote

by trothamel4 hours ago|

[-]

Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.

reply

upvote

by jmalicki4 hours ago|

[-]

The most advanced training data is in the form of rubrics as rewards.

A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.

https://arxiv.org/abs/2507.17746

reply

upvote

by dominotw4 hours ago|

[-]

meta has reallocated a significant protion of their staff to genrating this

reply

upvote

by sroussey2 hours ago|

[-]

Meta also reportedly took a 49% nonvoting stake in Scale AI in June 2025 for about $14.3–$14.8 billion.

reply

upvote

by dominotw4 hours ago|

[-]

let me take down armchair analysis with my armchair analysis

reply