Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.
> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.
Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.
Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.
You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.
I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.