Like wikipedia probably provides a significant amount of training for LLMs. And that is volunteer and free. (And I love the idea of it.)
But I can imagine (for example) board game enthusiasts to maybe want to have training data for games they love. Not just rules but strategies.
Or, really, any other kind of hobby.
That stuff (I guess) gets in training data by virtue of being on chat groups, etc. But I feel like an organized system (like wikipedia) would be much better.
And if these sets were available, I would expect the foundation model trainers would love to include it. And the results would be better models for those very enthusiasts.
This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch.
Huge kudos to Addie and the team for this :)
I agree that open weight models should not be considered open source, but I also think the entire definition breaks down under the economics of LLMs.
Passive transparency: training data, technical report that tells you what the model learned and why it behaves the way it does. Useful for auditing, AI safety, interoperability.
Active transparency: being able to actually reproduce and augment the model. For that you need the training stack, curriculum, loss weighting decisions, hyperparameter search logs, synthetic data pipeline, RLHF/RLAIF methodology, reward model architecture, what behaviours were targeted and how success was measured, unpublished evals, known failure modes. The list goes on!
And then, a ton of training still depends on human labor - even at $2/h in exploitative bodyshops in Kenya [1], that still adds up to a significant financial investment in training datasets. And image training datasets are expensive to train as well - Google's reCAPTCHA used millions of hours of humans classifying which squares contained objects like cars or motorcycles.