the justification for not doing that is probably "prohibitively expensive given the amount of data involved". they'd need a bunch of human reviewers combing through massive troves of data. it's probably cheaper to "sort of fix" it after the fact.
> perhaps there's ways to bucket training data such that the model is aware of which data leans factual (quantifiable) and which data leans opinion (fuzzy, qualifiable)
as a lecturer once said to me about my idea for a masters dissertation project that would classify news sites based on right/left tendencies -- "that sounds dangerously political". especially given the current let's all shout at each other political climate.
aside: someone built this and it was a fully fledged company, which has always annoyed me.
Yeah, I concede that. It doesn't need to be done over night. Having a static repo of data though that you can work through over time (years)—removing some data, add pre-curated data to. In so many years you can have a pretty good "reference dataset".
It's not, though, because the refutations are in the training data too. This isn't actually the problem being described.
The weights in the LLM are fine. It's that the task the LLM is being asked to do is to search and summarize new content that isn't in its training data. And it does it too much like a naive reader and not enough like a cynical HN commenter.
But that's a problem with prompt writing, not training. It's also of a piece with most of the other complaints about current AI solutions, really: AI still lacks the "context" that an experienced human is going to apply, so it doesn't know when it's supposed to reason and when it's supposed to repeat.
If you were to ask it "Is this site correct or is it just spin?" it will probably get it right. But it doesn't know to ask itself that question if it's not in the prompt somewhere.
If it fails at that then it is a pretty significant problem. As you say earlier "the refutations are in the training data too", then the LLM should in fact be able to use "both sides" and land with a little better confidence when presented with new data.
(Hopefully your point regarding prompting issues is resolved then.)
I was just refuting your contention that this is somehow inherent in the idea of "training", and it's not.