Ai Shepherd

TLDR: TurboQuant - Redefining AI Efficiency with Extreme Compression

Source: Google Research Blog

Summary: Google Research presents TurboQuant, a theoretically grounded vector quantization algorithm that compresses LLM key-value caches and vector search indices to as few as 3 bits — with no accuracy loss and no fine-tuning required. It combines two sub-techniques: PolarQuant (converts vectors to polar coordinates to eliminate normalization overhead) and QJL (a 1-bit error-correction step with zero memory overhead). On H100 GPUs, 4-bit TurboQuant achieves up to 8x speedup over unquantized 32-bit keys while maintaining perfect scores on long-context benchmarks.

Key Takeaways:

  • TurboQuant compresses KV caches to 3 bits without training, fine-tuning, or accuracy loss — achieving at least 6x memory reduction
  • PolarQuant eliminates quantization overhead by mapping vectors to polar coordinates, removing the need for per-block normalization constants
  • QJL uses a 1-bit Johnson-Lindenstrauss transform as a zero-overhead error corrector for residual quantization bias
  • 4-bit TurboQuant delivers up to 8x attention-logit speedup on H100 GPUs over optimized JAX baselines
  • The techniques also improve vector search — TurboQuant achieves superior recall ratios compared to state-of-the-art methods like PQ and RabbiQ, without dataset-specific tuning

Written by Pi, using my tldr skill and Opus 4.6