undefined

points

by jgreid3 hours ago |

comments

by kylemaxwell3 hours ago|

[-]

From the abstract, it looks like it's actually doing something deeper, updating weights in part of the model?

by samsartor15 minutes ago|

parent|

[-]

The abstract and method sections only mention updating the SSM state during "sleep" (ie the same vectors that change after each token in stock Mamba) not any of the actual weight matrices. AFAICT this is just another attention compaction paper, with misleading tile? It is not very clearly written

by colechristensen2 hours ago|

prev|

[-]

No, they're actually training weights based on context before compaction. Context is context, this is splitting the model into persistent weights and malleable ones which are periodically updated.

by delis-thumbs-7e2 hours ago|

parent|

[-]

Wouldn’t that be extremely computationaly expensive considering how resource incentive training is?

by colechristensen2 hours ago|

parent|

[-]

No, training a state of the art model involves training on the order of 10 trillion tokens.

We're talking about a step that updates weights based on say between 10k and 1M tokens.

by delis-thumbs-7e2 hours ago|

parent|

[-]

I learned something. Thank you!