For example, TurboQuant makes use of QJL (quantized Johnson Lindenstrauss transformations). One of the first papers to characterize the QJL and in fact the rate distortion tradeoff for quantized matrix multiplication in general is "Optimal Quantization for Matrix Multiplication" (https://arxiv.org/abs/2410.13780) by Ordentlich and Polyanskiy.
There is also a more accessible survey paper around quantized matrix multiplication called "High-Rate Quantized Matrix Multiplication: Theory and Practice" (https://arxiv.org/abs/2601.17187), by the same authors.
TurboQuant cites none of them.
The attribution is thin, the “6x compression” headline is not clearly separated from prior KV-cache quantization baselines like KIVI, and the RaBitQ comparison is hard to take seriously: single-core CPU for the baseline, A100 GPU for TurboQuant. It is comparing apples-to-datacenter. Worse, there are also public OpenReview comments saying that even the reported accuracy results are not reproducible.
Hard to believe this is the standard for something being promoted as a breakthrough. If this came from a random startup blog, people would be much harsher about it.
The quantizer in TurboQuant is EDEN quantization (2021) applied to the KV-cache. It is neither a novel quantizer nor an improvement in quantization techniques.
In DRIVE/EDEN, we already introduced the version used in "TurboQuant"'s paper and suggested an optimal scale configurations which are better in both mse-minimizing and unbiased scenarios.
(*hopefully I didn't misunderstand the situation)
`vllm.model_executor.layers.quantization.turboquant`
> The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).
HIGGS is an extension of EDEN (using the well known method for blockwise Lloyd-Max).
The proper framing of this "TurboQuant" layer in vllm (which does not include JQL) is precisely EDEN 22 without the scale correction.
"This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to . EDEN supports both biased and unbiased quantization, each optimized by a different (chosen via methods described in the EDEN works). The fixed choice used by TurboQuant is generally suboptimal, although the optimal for biased EDEN converges to as the dimension grows; accordingly TurboQuant approaches EDEN's behavior for large . Second, TurboQuant combines a biased -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its -bit step uses the suboptimal ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized ) is more accurate than TurboQuant, and unbiased EDEN is markedly more accurate than TurboQuant, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried."
(In any case, I want to emphasize that TurboQuant quantizer is a private case of EDEN)
Both EDEN and its 1-bit variant have been implemented in PyTorch, JAX, and TensorFlow across numerous open-source libraries and are used in various applications. I am currently writing a blog post that will document these in detail.
EDEN defines a scale parameter, S, for which we suggest specific optimal values for both biased and unbiased versions. As shown in the note I shared, these values lead to clear empirical improvements. Consequently, users who rely on the less optimal S value and the unbiasing method popularized by TurboQuant will generally see inferior results compared to those using EDEN with the optimal scale values suggested in our original papers.