undefined

points

[-]

Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.

by azinman27 hours ago|

parent|

[-]

Good point!

I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.

by aghilmort5 hours ago|

parent|

[-]

we dug into those sorts of questions with hypertokens, a robust hash for lines, code, tables/rows or any in-context token tagging to give models photographic memory

one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space

we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash

we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing

by pbowyer2 hours ago|

parent|

[-]

For others, here's the paper: https://arxiv.org/abs/2507.00002

by MrGreenTea4 hours ago|

parent|

prev|

[-]

The author writes that these hashes are 2 or 3 characters long. I assume depending on the line count. That's good for almost 48k lines. You have other issues then.

by azinman24 hours ago|

parent|

[-]

But if it’s a hash vs a line number, then we can collide much more easily.

There many be many lines that are duplicates, eg “{“

by giancarlostoro7 hours ago|

prev|

[-]

I was wondering the same thing.