Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?
The compression is almost certainly in part specific knowledge getting fuzzed.
Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.
Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.
It's hardly self-evident, and your counter-example is hardly applicable.
The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".
not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective
The memorization of say 100000 world facts through training texts, which enrich model associations all around, is absolutely not the same as rote memorization on 10^50 digits of pi. Not for a human, and even more so, not for an LLM.
An LLM trained with digits of pi and one trained with books and posts, even if they both have the exact same amount of bytes of training input, would not be comparable in any way in utility and reasoning capabilities.
>There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.
Which is irrelevant. Anyway, the amount of information that doesn't form useful logical associations is even larger (e.g. actual human books vs possible permutations of characters and spaces). Just like those (random) possible permutations of characters aren't good for LLM input to get logical associations out of it, pi isn't either (logical associations of the kind we care for and expect, not of the kind related to pi's sequences).
Also it's not only not self-evident, it's also apparently wrong.
You're making the assumption that anything produced by a human necessarily contains more useful information than random noise does. This is false. Even when only considering human intelligence, it's entirely possible to absorb information that makes you stupider, not smarter; learning is only valuable if you actually learn the right things.
I'd say this exchange is a fine example of that :)
We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.
If you believe this then you don't understand AI or natural intelligence well enough to refute my statements either.
Perhaps you're trying to refer to something specific by "cross-domain" competence, but firstly, humans vastly overestimate the extent to which experts in one domain can be trusted to speak accurately on topics in other domains (this is a form of authority bias), and secondly, real cross-domain expertise is a result of pre-existing metacognitive ability such as keen reasoning ability, intense focus, and learning-how-to-learn. In other words, Leonardo da Vinci was not a genius because he was a polymath; he was a polymath because he was a genius.
Likewise, I see no evidence that "generalist models" have proven anything about their ability over domain-specific ones other than that the big AI firms seem to believe that "generalist models" are their golden ticket to AGI and therefore a quintillion-dollar valuation. It's obvious in the long run that tools built for specialized tasks will outperform generalist tools for specific tasks, in the same way that a multi-axis CNC mill does not outperform your bog-standard lathe for shaping objects with rotational symmetry, or perhaps more pertinently to this conversation, how no LLM will ever outperform Stockfish at chess.