upvote
This seems overly pessimistic.

I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.

At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.

And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.

reply
What does this even have to do with the parent? Your capabilities have nothing to do with LLM capabilities. The two work in completely different ways. The reason LLMs work is because they are huge and have been trained on vast amounts of data, full stop. Sure, there's potential someday to get something useful using less data, but we aren't there.
reply
You are right on the limitations of the architecture but I wouldn't call LLMs huge. Flagship models maybe but that's just because they don't scale very well.

A universal translator with image and voice recognition and a decent breadth of encyclopedic knowledge in only a small fraction of an English Wikipedia dump(6GB/20+GB) is not "huge".

It is probably closer to the theoretical limit than anyone could have expected.

reply
You're also embodied and experiencing the world around you with more senses than only the ability to read text.
reply
> the only "dataset" you need is a bunch of sensors and the world around you.
reply
Not the whole thing, at least with current technology, but LoRAs are really good at fine tuning, and can be generated in a few hours on high-end gaming computers, so as long as the base model is in your language, you likely have enough spate computing power, in whatever electronics you own, to train a few LoRAs a month.

In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.

reply
There is so much technology that we are unable to reproduce locally, I don't think LLMs are in any way different. There will be large LLM manufacturers, small LLM manufacturers, LLM artisanals, LLM enthusiasts and of course LLM consumers, just like with everything.
reply
And this is important because even though you are running a model locally, it's still a proprietary model. You have no say in what it was trained on, how that training data is labeled, what the guardrails are, what biases it might have, none of that.
reply
Depends on the domain. There are plenty of different use cases where the data needed for training is available for personal, or non-commercial, use. At that point, it does come down to compute/time to do the training, which if you are willing to wait, consumer grade hardware is perfectly capable of developing useful models.
reply
Can you make your own CPU, locally?
reply
That's a fair concern, but I'd separate training from inference here
reply
That sounds like government. So your problem is mostly that you expect to have a collective social effort, but not enough to pay for it as a public good.
reply