upvote
I asked this question a while back (the "only train w/ wikipedia LLM") and got pointed to the general-purpose "compression benchmarks" page: `https://www.mattmahoney.net/dc/text.html`

While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).

Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?

If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.

reply
Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.

Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.

reply