Related papers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

URL: http://arxiv.org/abs/2504.19874v1
Date: Mon, 28 Apr 2025 15:05:35 GMT
Title: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Authors: Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni,
Abstract summary: Vector quantization aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure.<n>We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion.<n>Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates.
Score: 13.14434628836727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Adaptive variational quantum dynamics simulations with compressed circuits and fewer measurements [4.2643127089535104]
We show an improved version of the adaptive variational quantum dynamics simulation (AVQDS) method, which we call AVQDS(T)<n>The algorithm adaptively adds layers of disjoint unitary gates to the ansatz circuit so as to keep the McLachlan distance, a measure of the accuracy of the variational dynamics, below a fixed threshold.<n>We also show a method based on eigenvalue truncation to solve the linear equations of motion for the variational parameters with enhanced noise resilience.
arXiv Detail & Related papers (2024-08-13T02:56:43Z)
Fast Flux-Activated Leakage Reduction for Superconducting Quantum Circuits [84.60542868688235]
leakage out of the computational subspace arising from the multi-level structure of qubit implementations. We present a resource-efficient universal leakage reduction unit for superconducting qubits using parametric flux modulation. We demonstrate that using the leakage reduction unit in repeated weight-two stabilizer measurements reduces the total number of detected errors in a scalable fashion.
arXiv Detail & Related papers (2023-09-13T16:21:32Z)
Randomized semi-quantum matrix processing [0.0]
We present a hybrid quantum-classical framework for simulating generic matrix functions. The method is based on randomization over the Chebyshev approximation of the target function. We prove advantages on average depths, including quadratic speed-ups on costly parameters.
arXiv Detail & Related papers (2023-07-21T18:00:28Z)
Quantum Worst-Case to Average-Case Reductions for All Linear Problems [66.65497337069792]
We study the problem of designing worst-case to average-case reductions for quantum algorithms. We provide an explicit and efficient transformation of quantum algorithms that are only correct on a small fraction of their inputs into ones that are correct on all inputs.
arXiv Detail & Related papers (2022-12-06T22:01:49Z)
Quantum Sparse Coding [5.130440339897477]
We develop a quantum-inspired algorithm for sparse coding. The emergence of quantum computers and Ising machines can potentially lead to more accurate estimations. We conduct numerical experiments with simulated data on Lightr's quantum-inspired digital platform.
arXiv Detail & Related papers (2022-09-08T13:00:30Z)
On One-Bit Quantization [27.057313611640918]
We characterize the optimal one-bit quantizer for a continuous-time random process that exhibits low-dimensional structure. We numerically show that this optimal quantizer is found by a neural-network-based compressor trained via gradient descent.
arXiv Detail & Related papers (2022-02-10T19:07:06Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Distance-aware Quantization [30.06895253269116]
Quantization methods use a rounding function to map full-precision values to the nearest quantized ones. We introduce a novel quantizer, dubbed a distance-aware quantizer (DAQ), that mainly consists of a distance-aware soft rounding (DASR) and a temperature controller.
arXiv Detail & Related papers (2021-08-16T09:25:22Z)
Realization of arbitrary doubly-controlled quantum phase gates [62.997667081978825]
We introduce a high-fidelity gate set inspired by a proposal for near-term quantum advantage in optimization problems. By orchestrating coherent, multi-level control over three transmon qutrits, we synthesize a family of deterministic, continuous-angle quantum phase gates acting in the natural three-qubit computational basis.
arXiv Detail & Related papers (2021-08-03T17:49:09Z)
Variational Quantum Optimization with Multi-Basis Encodings [62.72309460291971]
We introduce a new variational quantum algorithm that benefits from two innovations: multi-basis graph complexity and nonlinear activation functions. Our results in increased optimization performance, two increase in effective landscapes and a reduction in measurement progress.
arXiv Detail & Related papers (2021-06-24T20:16:02Z)
Efficient and robust certification of genuine multipartite entanglement in noisy quantum error correction circuits [58.720142291102135]
We introduce a conditional witnessing technique to certify genuine multipartite entanglement (GME) We prove that the detection of entanglement in a linear number of bipartitions by a number of measurements scales linearly, suffices to certify GME. We apply our method to the noisy readout of stabilizer operators of the distance-three topological color code and its flag-based fault-tolerant version.
arXiv Detail & Related papers (2020-10-06T18:00:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.