(www.mixedbread.com)
Saying "Near lossless" to mean 90% accurate retrieval of saved vectors is simply a lie. Lossy-ness is binary, not something you can paper over with getting close enough. And 90% is not close. Sure, LLMs are all about gradient descent on noisy data sets so I guess this is acceptable in this field but that terminology usage still bothered me
I could imagine a scenario where differences tend to be more substantive than you'd expect because of how less frequent words with fine distinctions in meaning - the very words that make the document special - may be embedded in the vector space.
I have also been working in compression and performance engineering, and managed to get a 99+% compression unlock versus conventional approaches (100+KB down to 1KB) in the scenario of 30 minute massive multiplayer game replays for a “game+engine” I’m developing
I think there’s a synergy between these 2 concepts I’d love to chat some more
In principle, binary x binary should be pretty fast since it just requires bitwise XNOR and popcount/reduction, but in practice it's slow unless you've really optimized it. And, as stated in the article, you'd still be losing a lot of accuracy that way.
sure
> if you treat it in a binary way where everything short of 100 falls into one "lossy" bucket you lose all the practical differences that make one encoding much better than another.
no; lossless is an inherently binary term. and I don't lose all the practical differences of better lossy encoders by understanding that; I'm not just going to start using mp3 96k because I have an understanding of lossless vs lossy encoders...
Lossless is an objectively binary term.
That typo up there is kind of endearing in the AI slop era.