TurboQuant: A first-principles walkthrough

upvote

TurboQuant: A first-principles walkthrough

(arkaung.github.io)

199 points

by kweezar10 hours ago |

upvote

by amitport8 hours ago|

[-]

TurboQuant is a restricted version of EDEN quantization (NeurIPS 21, ICML 22). It lacks the optimal scale derivations, which makes the TurboQuant variant considerably less accurate than those works. We show this thoroughly in a new note at https://arxiv.org/abs/2604.18555.

We were the first to introduce post-rotation distribution-aware quantization in 2021. This was later implemented in many fields, including federated learning, vector retrieval, databases, inference engines, and KV-cache.

It would be appropriate to receive credit for this. Furthermore, it is baffling to see the name "TurboQuant" repeated in this context, considering the many works published from 2021 onwards.

The blog post mentioned above essentially guides you through EDEN quantization but ultimately settles on a sub-optimal MSE-minimizing version and an unbiasing trick. This trick often costs a full bit more than DRIVE/EDEN requires to achieve the same results using the unbiasing scale shown in the original 2021 paper.

reply

upvote

by 0xbadcafebee7 hours ago|

[-]

For those who want to :popcorn-meme: the drama, there's some great comments on the peer review of the TurboQuant paper: https://openreview.net/forum?id=tO3ASKZlok

reply

upvote

by gajjanag1 hours ago|

[-]

There are also more papers on similar themes.

For example, TurboQuant makes use of QJL (quantized Johnson Lindenstrauss transformations). One of the first papers to characterize the QJL and in fact the rate distortion tradeoff for quantized matrix multiplication in general is "Optimal Quantization for Matrix Multiplication" (https://arxiv.org/abs/2410.13780) by Ordentlich and Polyanskiy.

There is also a more accessible survey paper around quantized matrix multiplication called "High-Rate Quantized Matrix Multiplication: Theory and Practice" (https://arxiv.org/abs/2601.17187), by the same authors.

TurboQuant cites none of them.

reply

upvote

by kumarhn45 minutes ago|

[-]

TurboQuant is starting to look like a case study in how to turn a fragile paper into a breakthrough story.

The attribution is thin, the “6x compression” headline is not clearly separated from prior KV-cache quantization baselines like KIVI, and the RaBitQ comparison is hard to take seriously: single-core CPU for the baseline, A100 GPU for TurboQuant. It is comparing apples-to-datacenter. Worse, there are also public OpenReview comments saying that even the reported accuracy results are not reproducible.

Hard to believe this is the standard for something being promoted as a breakthrough. If this came from a random startup blog, people would be much harsher about it.

reply

upvote

by amitport50 minutes ago|

[-]

I believe our claim at this point is more fundamental than just lack of citation.

The quantizer in TurboQuant is EDEN quantization (2021) applied to the KV-cache. It is neither a novel quantizer nor an improvement in quantization techniques.

In DRIVE/EDEN, we already introduced the version used in "TurboQuant"'s paper and suggested an optimal scale configurations which are better in both mse-minimizing and unbiased scenarios.

reply

upvote

by om81 hours ago|

[-]

https://docs.vllm.ai/en/v0.20.0/api/vllm/model_executor/laye...

`vllm.model_executor.layers.quantization.turboquant`

> The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).

reply

upvote

by 51 minutes ago|

[-]

deleted

reply

upvote

by 0xA2kag3 hours ago|

[-]

Thanks a lot for pointing this out. I will update this explainer to properly add the prior literature so that there is a proper attribution.

reply

upvote

by amitport3 hours ago|

[-]

Thanks for the quick response and for being willing to update the explainer. I really appreciate the clarification.

reply

upvote

by adrian173 hours ago|

[-]

I wonder how often this happens in practice - by "this", I mean industry/LLM world not noticing* some research until a bigger player repeats it with louder PR.

(*hopefully I didn't misunderstand the situation)

reply

upvote

by sva_2 hours ago|

[-]

Ask Jürgen Schmidhuber

reply

upvote

by moffkalast2 hours ago|

[-]

If we go only by the cases that have been publicly known it already happens all the time. Lots of patents are a race to register by multiple parties too and it's rarely done fairly.

reply

upvote

by KnuthIsGod7 hours ago|

[-]

https://arxiv.org/abs/2604.18555

"This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to . EDEN supports both biased and unbiased quantization, each optimized by a different (chosen via methods described in the EDEN works). The fixed choice used by TurboQuant is generally suboptimal, although the optimal for biased EDEN converges to as the dimension grows; accordingly TurboQuant approaches EDEN's behavior for large . Second, TurboQuant combines a biased -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its -bit step uses the suboptimal ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized ) is more accurate than TurboQuant, and unbiased EDEN is markedly more accurate than TurboQuant, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried."

reply

upvote

by gcr50 minutes ago|

[-]

FYI your comment is missing several constants/words and is hard to read

reply

upvote

by theredsix7 hours ago|

[-]

Are you guys going to follow up with a paper showing EDEN results match or beat turboquant for needle in a haystack benchmarks?

reply

upvote

by amitport6 hours ago|

[-]

The note includes extensive experiments and reproduces many of the figures from the TurboQuant paper in our Section 5. Honestly, I think our case is pretty clear-cut as is. I am not sure what the overhead for those specific benchmarks would be, but we will look into it.

(In any case, I want to emphasize that TurboQuant quantizer is a private case of EDEN)

reply

upvote

by meehai4 hours ago|

[-]

with the amount of traction this has gotten... coming with a clear set of experiments even on arxiv paper would be of great help to showcase your improvements. And if they're easily reproducible, they could get integrated in the mainstream inference engines as well, as the main point here is compression with little degradation.

reply

upvote

by amitport4 hours ago|

[-]

When you use TurboQuant, you are essentially using the EDEN quantizer under a different name applied to KV-cache.

Both EDEN and its 1-bit variant have been implemented in PyTorch, JAX, and TensorFlow across numerous open-source libraries and are used in various applications. I am currently writing a blog post that will document these in detail.

EDEN defines a scale parameter, S, for which we suggest specific optimal values for both biased and unbiased versions. As shown in the note I shared, these values lead to clear empirical improvements. Consequently, users who rely on the less optimal S value and the unbiasing method popularized by TurboQuant will generally see inferior results compared to those using EDEN with the optimal scale values suggested in our original papers.

reply

upvote

by mskkm4 hours ago|

[-]

The public comments on Openreview now include explicit allegations that the TurboQuant paper knowingly misrepresented RaBitQ and understated RaBitQ’s results. The RaBitQ authors also report in a technical note that several of TurboQuant’s runtime and recall numbers do not reproduce from the released code under the paper’s stated setup. In the note, TurboQuant generally loses to RaBitQ: https://arxiv.org/abs/2604.19528. If these public allegations hold up, then this is not just overhype or sloppy citation practice, but points to a distorted comparison and benchmark claims that do not survive reproduction.

reply

upvote

by linuxhansl9 hours ago|

[-]

I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)

Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.

reply

upvote

by everythingctl8 hours ago|

[-]

Maybe we can run more powerful models locally.

I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.

reply

upvote

by linuxhansl7 hours ago|

[-]

The size of the KV cache (context stored) is proportional to the number of layers of the model and number of "hidden dimensions". For a 400B model it could be 30-60GB for just an 8K context window (depends on the model, etc, just a ballpark).

So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy.

Edits: Better context

reply

upvote

by SilentM686 hours ago|

[-]

That's my hope as well as I tend to use low end GPUs (e.g. NVIDIA GeForce RTX 2060 @ 6GB). Been looking for an image generation model that can fit that vid card, for use with Ollama + GUI in Linux. No luck yet, since money's tight and jobs are tighter :(

reply

upvote

by MadnessASAP4 hours ago|

[-]

An Arc B580 will just about fit Flux.2 Klein (At FP8). However, you can also easily get much larger GPUs on RunPod or Vast at $0.25/hr.

I would strongly recommend exploring that option, renting an RTX 5090 for an evening of image generation for a dollar or two is way more fun then trying to jam big models on little cards. Just take some time to create a reasonable, scripted, deployment workflow for when you create a fresh instance.

reply

upvote

by fragmede4 hours ago|

[-]

hey what's your Venmo?

reply

upvote

by acters2 hours ago|

[-]

Just look at deepseek V4, this preview model uses only 8 GB for 1M token KV cache(the context). It's insanely efficient already. It's just that most models that are coming out are barely catching up with technical breakthroughs. Deepseek are pioneers.

Unfortunately V4 is not trained for most real world usage, it is mainly for world general knowledge.

reply

upvote

by qingcharles6 hours ago|

[-]

We're only a few years into this new tech getting serious research manhours thrown at it. Already some incredible optimizations have been found in a short amount of time. Not only has the efficiency of inference been increasing dramatically, the quality of tiny models has been significantly improving.

The future is bright for local AI.

reply

upvote

by treexs5 hours ago|

[-]

I feel like I've gotten really good at noticing which model generates what type of site and this oozes codex

reply

upvote

by 0xA2kag3 hours ago|

[-]

Hey, thanks for the pointer. Had I known this, I would have used codex (as a matter of fact, I have never used it before and this prompts me to use it if I can get something like this much quicker with codex). I think making codex copy this for a new content will be much easier now. The issue was with making things the way I exactly want, the exact intuition, the exact primers, and the exact visuals to drive the point home.

reply

upvote

by treexs3 hours ago|

[-]

Woah very cool, yeah I think the cards and heading/subheading structure is very similar to what codex outputs, but I can tell the different visualizations definitely require your own personal touch

reply

upvote

by npodbielski3 hours ago|

[-]

What did you use?

reply

upvote

by gcr52 minutes ago|

[-]

On TheTom’s llama-cpp fork, TurboQuant makes inference about five to ten times slower than vanilla (M1 Max, qwen3.6-35b-a3b). Seems like the productionization is still a ways away.

reply

upvote

by jarbus8 hours ago|

[-]

This is incredible. Interactive demos like this make mathematics 10x more accessible

reply

upvote

by 0xA2kag4 hours ago|

[-]

Thanks a lot <3

reply

upvote

by scaillib3 hours ago|

[-]

what are the tools you used (if any)?

reply

upvote

by 0xA2kag2 hours ago|

[-]

It is just a bunch of ideas on how to make things intuitive and a bunch of hand holding Claude Code to get exactly what I want. Underlying is plain HTML CSS and JS.

reply

upvote

by 5 hours ago|

[-]

deleted

reply

upvote

by nafistiham4 hours ago|

[-]

Thanks a lot. It helped me get a much more detailed view of turboquant than a few youtube videos that I watched. Also, the choice of color is excellent as it serves both light and dark mode. I'll try to use it in my sites. Kudos!

reply

upvote

by 0xA2kag3 hours ago|

[-]

Thanks a lot <3

reply

upvote

by vb-84483 hours ago|

[-]

what did the author used to create the site?

reply

upvote

by 0xA2kag3 hours ago|

[-]

I did a bunch of things :D I am not a frontend engineer (I am MLE) so I don't have the prowess to create things like this. I am heavily inspired by 3blue1brown and I love creating interactive explainers for ML concepts like this. I previously created this as well arkaung.github.io/interactive-eigenvector/. I heavily used Claude to get to the the exact design, typography, and style I want (there was a lot of hand holding to get to this state). I heavily influenced Claude on how I want the explainer to flow, how I want to make things intuitive, the kinds of mathematical concepts I want to visualize (and how). So all in all, a lot of hand holding for the Coding agents to get to where I want and exactly how I want.

But at the end of the day it is just vanilla HTML, CSS and JS without anything fancy :D MathJax 3 was used to render math stuff.

reply

upvote

by morbicer3 hours ago|

[-]

The fonts, the cards, the copy are all hallmarks of Claude Code.

While the aesthetic doesn't spark joy for me, the overall execution is great, the presentation flow and interactive boxes are very nice.

reply

upvote

by sirluky3 hours ago|

[-]

xcz

reply

upvote

by marlburrow4 hours ago|

[-]

[dead]

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by jiusanzhou5 hours ago|

[-]

[dead]

reply

upvote

by TranspectiveDev8 hours ago|

[-]

[dead]

reply

upvote

by iggerews8 hours ago|

[-]

[dead]

reply

upvote

by semiinfinitely6 hours ago|

[-]

"AI vectors"

reply