Essentially, many people want to know what the minimum amount of memory is to run a particular model.
Parameter count obscures important details: what are the sizes of the parameters? A parameter isn't rigorously defined. This also gets folks into trouble because a 4B param model with FP16 params is very different from a 4B param model with INT4 params. The former obviously should be a LOT better than the second.
This would also help with MOE models: if memory is my constraint, it doesn't matter if the (much larger RAM required) MOE version is faster or has better evals.
I'm waiting for someone in anger to ship the 1 parameter model where the parameter according to pytorch is a single parameter of size 4GB.
This doesn’t solve the capacity problem of memory. You can cram more into one context window, but then again you need to associate them with input queries. That’s very hard because slight variations in input create hugely different activations. So really, it doesn’t improve caching. This paper might do a thing or two approximating the compression limit for context windows, but there’s a fundamental limit on how much information can go into it. What you really need is contextual search, as in, different events and objects with the same abstractions and semantic lead to same response, so you can cache effectively… on this front the paper does little to improve “memory” in a meaningful way
https://jdsemrau.substack.com/p/tokenmaxxing-and-optimizing-...
> Prioritize recall over precision.
Have you tried stemming your regex? That would help you catch messages where a different form of your word appeared. For example instead of “story” you look for “stor” which catches “stories” as well.
Then you might think, could we do an even better job by figuring out the general semantic intent of the query and history? Let’s project them into a semantic vector space! That’s an embedding.
Then you want to query that, which means you need a vector database. So now we can take the query, embed it, query the vector DB with that embedding and retrieve the N closest history documents. You can use that to augment the generation of the response to your prompt.
This is RAG.
Anyway, interesting to see different degrees of sophistication here. Certainly a handful of naive regex are very snappy.
There’s probably a hybrid approach where you use sophisticated NLP and embedding techniques to robustly define topics, then train a regex to approximate that well.
A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.
Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.
Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.
You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.
I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.
As you hit the limits and try to compact the context, etc., things get more erratic.
You can try to summarize memories tersely and point the agent to longer markdown files, but who knows if it will read it at the right time and only then.
- fixed size of the memory seems like a good idea to overcome the current limitations
- skimming through the thing, I can't find any mention of the cost?
- I would need more time to read it in-depth to see if this is legitimate and not just fancy form of overfitting or training on testing data
Nothing super novel or groundbreaking, but a moderately interesting read.
Is it a lowercase to uppercase conversion going on here?
What I want to see is something that was tested and proved in practice to be genuinely useful, especially for coding agents.
Beads kind of does "LLM memory over CLI", or there is https://github.com/wedow/ticket which is a minimal and sane implementation of the same idea.
I haven't measured, but documenting bug fixes and architecture seems to help, along with TDD patterns, including integration tests.
I would probably add it to Claude.md to look for all of the above when tackling a new bug.
While you can document everything and use git history, I think that having short entries in a kind of memory to remember past decisions, how issues were solved would be much more token efficient than reading lots of documentation and looking at git history and past code.
(Obviously ignoring the huge energy saver, which is to observe if you even need to bother doing the task at all.)
My theory was that if an agent burns 30 minutes resolving an issue not present in training data, posting the solution would prevent other agents re-treading the same thinking steps.
This has the benefit of it knowing all of the arcane flags, especially for formatting output.