upvote
Continuous learning allows past behavior and past inputs to influence future inputs and future behavior. In humans.

Attention over KV cache allows past behavior and past inputs to influence future inputs and future behavior. In LLMs.

Until the cache runs out, that is. But even then, you could totally use any of 9000 methods of cache compression, truncation, dropping or streaming and get away with it.

The difference between continuous learning and in-context learning seems to be in capacity, not in principle. Both are doing a similar thing, but one has more length and depth to it.

reply
Maybe, every night, you send the AI off to "sleep" where it uses those in cache "memories" to influence the long term weights [1].

[1] https://www.pnas.org/doi/10.1073/pnas.2220275120

reply
This is really semantics, but I wouldn't call attending to the KV cache re-reading the context.

The model takes in the context, encodes it into a "memory" (the KV cache), and accesses that memory later. That fact doesn't change just because the KV cache grows in size with the context.

I don't know what memory would look like other than an encode-retrieve loop.

Relevant: Transformers are Multi-State RNNs - https://arxiv.org/abs/2401.06104

reply