upvote
Small models were originally built from distilling, using synthetic training materials, and filtering training material with much larger models. There is a bit of a bootstrapping problem where to build a good LLM you need a working LLM and if you don't have one the costs are absolutely eye watering.

One observation is that the LLM is a next token predictor but if you train it on the internet/textbooks/etc you get a predictor of that--- but that isn't the behavior we actually want. None of these sources tend to contain "Solve this problem for me. OK, here is the solution:".

It wasn't physically impossible to start GNU the other way around, by bashing machine code into a system until you had a working operating system. But doing so would have been a lot less reasonable-- much more expensive, making progress much less quickly, etc.

reply