undefined

points

[-]

I feel this is a sort of inverse inspection paradox (the paradox that if you sample waiting time in a process, you’re more likely to sample a larger value).

The LLM providers fine tune the models with some kind of information retrieval tasks, but to do so you must provide some non relevant context to bootstrap the session for the long context tasks.

It would be very easy to do this in ways that train the sequence model to treat early history as noisier than it really is, or to weaken its relationship to late context.

You’re also probably stacking more contexts together with long contexts (start with task A, then detour to solving B and C before you can complete A).

Training sequence lengths probably decay super linearly with length creating far fewer samples at long length during training.