I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.
one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space
we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash
we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing
There many be many lines that are duplicates, eg “{“