TurboQuant is a novel algorithm for compressing the KV cache in GPU memory used by language models during conversations, aiming to alleviate the memory bottleneck encountered during inference. It achieves a significant reduction in memory usage by at least six times while maintaining zero accuracy loss. This advancement enables context windows to contain millions of tokens, with caches extending to hundreds of gigabytes per session. The groundbreaking work will be showcased at the International Conference on Learning Representations (ICLR) in 2026.
TurboQuant employs two innovative sub-algorithms to achieve efficient KV cache compression: PolarQuant and QJL (Quantized Johnson-Lindenstrauss). The PolarQuant technique works by decoupling the magnitude and direction components, which allows for more efficient compression of data. Unlike traditional methods, it eliminates the requirement for quantization constants that typically add overhead, ensuring that the compressed data maintains its integrity without added bits per value.
The QJL method further enhances compression by reducing residual errors to a single sign bit while storing no constants. This method contributes to significant compression ratios without compromising data accuracy. Importantly, TurboQuant provides a mathematically unbiased estimator for attention calculations in transformer models, facilitating zero accuracy loss during inference.
This approach enables TurboQuant to effectively compress intermediate memory structures that store attention maps in language models, addressing the inference-memory bottleneck. As a result, it supports contexts involving millions of tokens and caches amounting to hundreds of gigabytes per session, without the need for model retraining or fine-tuning, thereby integrating seamlessly into existing inference pipelines.
TurboQuant demonstrated impressive performance in benchmarks conducted on the open-source models Gemma, Mistral, and Llama. Under 4x compression, it matched the full-precision performance, achieving perfect results on complex needle-in-haystack tasks that involved up to 104,000 tokens. It is important to note that the claim of “zero accuracy loss” is specifically applicable to KV cache compression during inference and does not extend to model weights. TurboQuant focuses on compressing the temporary memory utilized for mid-session attention computations.
Moreover, the evaluations did not include Google’s extensive Gemini model at scale, indicating a limitation in the scope of the testing. TurboQuant is advantageous for integration into current inference pipelines as it requires no retraining or fine-tuning and introduces negligible runtime overhead. This allows for efficient memory management during language model operations without compromising computational accuracy.
The publication of TurboQuant prompted immediate industry attention. Cloudflare CEO Matthew Prince called it Google’s “DeepSeek” moment. On the same day as the announcement, memory stock prices for Micron, Western Digital, and Seagate fell. These reactions were reported in media coverage of the TurboQuant announcement.
Reporting combined executive commentary with contemporaneous market movement. The documented responses included both a public executive remark and same-day share-price declines among major memory suppliers.


