upvote
Efficient execution on the GPU appears to have been one of the specific aims of the authors. Table 2 of their paper shows real world performance that would appear at a glance to be compatible with inference.
reply
This is not an LLM inference result. Table 2 is the part I find most questionable. Claiming orders-of-magnitude improvements in vector search over standard methods is an extraordinary claim. If it actually held up in practice, I would have expected to see independent reproductions or real-world adoption by now. It’s been about a year since the paper came out, and I haven’t seen much of either. That doesn’t prove the claim is false, but it certainly doesn’t inspire confidence.
reply
reply
They confirmed on the accuracy on NIAH but didn't reproduce the claimed 8x efficiency.
reply
Classic academic move. If the authors show accuracy-vs-space charts but hide end-to-end latency, it usually means their code is slower in practice than vanilla fp16 without any compression. Polar coordinates are absolute poison for parallel GPU compute
reply
I don't think they're using polar coordinates? They're quantizing to grid centroids.
reply