However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.
When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.
Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?
That’s what the paper is referring to.
If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.