"Nothing is more practical than a good theory," as Vladimir Vapnik liked to say when developing his theory of SVMs. This adage has come true again in a recent series of beautiful papers from Google.
The Johnson-Lindenstrauss (JL) Lemma is a staple of classical algorithms and high-dimensional statistics. This remarkable result says that there is a map from points in a high-dimensional space to a much smaller-dimensional space that preserves all pairwise distances within a (1±ε) factor, with the target dimension scaling only logarithmically in the number of points, independent of the original dimension. Moreover, the mapping is strikingly simple: just project by multiplying by a random matrix (where each entry is independently and identically distributed, say from a Normal distribution). The JL Lemma is thus a very simple technique for dimensionality reduction and, in this form, a part of the standard data science toolkit.
The Google team has now demonstrated its use for compressing neural networks by quantization, representing weights and activations in just a few bits rather than the full FP-16 representation. The central attention mechanism of the Transformer architecture involves computing inner products between key and value vectors. A natural idea is to use the JL Lemma to first project the vectors into a low dimension before computing the inner products, since the JL Lemma can be shown to also preserve inner products.
Going further, the Google paper shows that one can even round the projected vector and retain only the signs, using just one bit per coordinate. Interestingly, the method is asymmetric: the keys are quantized to 1 bit while the queries remain in full precision. This asymmetry is a deliberate practical choice. At inference time, you have one query but potentially millions of cached keys, so storage cost is dominated by the keys. Quantizing the side you store many copies of is the move that pays off.
The resulting inner product is an unbiased estimate of the inner product of the vectors in the original high-dimensional space. This Quantized JL Lemma, or QJL, is shown in the paper to be a very effective tool to compress the KV cache in Transformers without losing much accuracy. It seems likely to be a very useful tool for many other applications.
In the second paper, the Google team used randomness again, this time in a very elegant way, for quantization. Now, they rotate a vector by applying a random matrix. Even when applied to an arbitrary vector, the result is a vector that looks random: in high dimensions, each coordinate of the rotated unit-norm vector concentrates around an N(0, 1/d) distribution. This is a consequence of measuring concentration on the high-dimensional sphere, and it lets you precompute a single universal codebook (Lloyd-Max optimal for the Gaussian) that works on any input, without ever seeing the data.
But there's a subtlety: this scalar quantization step, while it minimizes mean squared error, does not give an unbiased estimate of inner products. The paper offers a surprisingly elegant solution. Use random rotations + scalar quantization at (b-1) bits to get a low-variance but biased estimate. Then use QJL on the residual error with just 1 bit to correct the bias. The two stages combine to give an unbiased estimate of the inner product at b bits total, capturing the strengths of both techniques. PolarQuant brings low variance, QJL eliminates bias, and you get both for the price of b bits.
It turns out that this "TurboQuant" method is close to the best possible: it approaches the information-theoretic limit given by a classical result of Shannon. Further compression of the KV cache via quantization alone will be narrow gains from here. The remaining headroom lies in what you compress (architectural changes like MLA and GQA) or in how you compress along orthogonal dimensions (entropy coding, calibration-based methods).
The paper demonstrates that TurboQuant works extremely well in practice. One application discussed is nearest-neighbor search in information retrieval problems. The quantized method achieves very good precision and recall, and the Google researchers write in an accompanying blog post that it should have many applications in semantic search.
What makes TurboQuant particularly interesting from an engineering standpoint is that the entire mathematical apparatus, random rotation, codebook lookup, and residual correction, maps remarkably cleanly onto GPU hardware. The random rotation can be implemented as a Walsh-Hadamard Transform in O(d log d) time using only additions and subtractions, with no multiplications in the hot path. The codebook is small enough to live in constant memory. The per-coordinate quantization is parallel.
From Embedl’s point of view, this observation means that Turboquant and QJL also have great potential for use with our recent FlashHead techniques, which cast the token generation problem as an information retrieval problem, as we discussed in previous blog posts. So stay tuned for exciting updates on the results!