upvote
Agreed, and point number two is the tricky one. Creating a list of tasks is easy; evaluating them is not. You need a consistent task set, a "clean slate" control (i.e., Claude code without memory is your proper control) and an evaluation criteria which differentiates "uses fewer tokens" from "produces better results," otherwise you end up with vendors evaluating their own work.

Currently constructing a repeatable test harness for PMB: Fixed task, with/without memory, repeated N times, giving number of tokens/turns/passed/not passed with a subjective quality score too. Would be happy to share the task set and evaluation criteria for testing on anyone else's memory server or clean slate control, not just mine.

reply
every time I see these memory agents, all I can think about is context bloat and posioning. We know humans have trouble with memories from a different realm: to "remember" something of significance, the human brain reconstructs the entire experience, which is why they're so easy to influence.

That seems to be what most of these systems are doing: amplifying erros and hallucinations more than anything else.

reply
It is a legitimate worry, but I would make it two separate questions since bloating and poisoning have their own solutions.

Bloating: PMB does not inject anything into the store, just gets a small top-k relevant snippet for every task - normally a few hundred tokens, not an increasing dump from the store.

Poisoning is the one that is more interesting and your example with reconstruction proves the point that PMB does not have LLM on its read side. Human memory - and indeed any mechanism that uses paraphrasing while recalling information via the model - reconstructs the information on its own each time, and that's what makes it susceptible to manipulation and hallucination.

That which cannot be accomplished is to correct garbage in, where a lesson wrongly learned is faithfully recalled. Mitigations include the fact that everything is verbatim, source-stamped (by who, when, session), de-duped, recently decayed, and correctable, and all of which is displayed on a dashboard – making an error of recall detectable and auditable, rather than silently reconstructive drift. Detection of conflict/supersession is the next build.

reply