Related papers: SAQ: Pushing the Limits of Vector Quantization through Code Adjustment and Dimension Segmentation

SAQ: Pushing the Limits of Vector Quantization through Code Adjustment and Dimension Segmentation

URL: http://arxiv.org/abs/2509.12086v1
Date: Mon, 15 Sep 2025 16:14:05 GMT
Title: SAQ: Pushing the Limits of Vector Quantization through Code Adjustment and Dimension Segmentation
Authors: Hui Li, Shiyuan Deng, Xiao Yan, Xiangyu Zhi, James Cheng,
Abstract summary: Approximate Nearest Neighbor Search (ANNS) plays a critical role in applications such as search engines, recommender systems, and RAG for LLMs.<n> Vector quantization (VQ) is commonly used to reduce space overhead and accelerate distance computations.<n>We propose a novel VQ method called SAQ to balance encoding efficiency and quantization accuracy.<n>We show SAQ achieves up to 80% reduction in quantization error and accelerates encoding speed by over 80x compared to Extended RabitQ.
Score: 13.282924439395204
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Approximate Nearest Neighbor Search (ANNS) plays a critical role in applications such as search engines, recommender systems, and RAG for LLMs. Vector quantization (VQ), a crucial technique for ANNS, is commonly used to reduce space overhead and accelerate distance computations. However, despite significant research advances, state-of-the-art VQ methods still face challenges in balancing encoding efficiency and quantization accuracy. To address these limitations, we propose a novel VQ method called SAQ. To improve accuracy, SAQ employs a new dimension segmentation technique to strategically partition PCA-projected vectors into segments along their dimensions. By prioritizing leading dimension segments with larger magnitudes, SAQ allocates more bits to high-impact segments, optimizing the use of the available space quota. An efficient dynamic programming algorithm is developed to optimize dimension segmentation and bit allocation, ensuring minimal quantization error. To speed up vector encoding, SAQ devises a code adjustment technique to first quantize each dimension independently and then progressively refine quantized vectors using a coordinate-descent-like approach to avoid exhaustive enumeration. Extensive experiments demonstrate SAQ's superiority over classical methods (e.g., PQ, PCA) and recent state-of-the-art approaches (e.g., LVQ, Extended RabitQ). SAQ achieves up to 80% reduction in quantization error and accelerates encoding speed by over 80x compared to Extended RabitQ.

Related papers

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
ZeroQAT: Your Quantization-aware Training but Efficient [53.25965863436039]
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs)<n>Existing low-bit PTQ methods suffer from accuracy degradation because their layer-wise optimization introduces cumulative error propagation and misalignment between local reconstruction objectives and downstream performance.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework.
arXiv Detail & Related papers (2025-08-21T01:18:27Z)
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm [52.89358421626026]
GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale.<n>We show that GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem.
arXiv Detail & Related papers (2025-07-24T16:22:18Z)
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling [53.91873442457923]
Vector Quantization (VQ) serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy.<n>This paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework.<n> Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5% zero-shot accuracy.
arXiv Detail & Related papers (2025-06-05T08:58:58Z)
Fast correlated decoding of transversal logical algorithms [67.01652927671279]
Quantum error correction (QEC) is required for large-scale computation, but incurs a significant resource overhead.<n>Recent advances have shown that by jointly decoding logical qubits in algorithms composed of logical gates, the number of syndrome extraction rounds can be reduced.<n>Here, we reform the problem of decoding circuits by directly decoding relevant logical operator products as they propagate through the circuit.
arXiv Detail & Related papers (2025-05-19T18:00:00Z)
FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models.<n>It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters.<n>It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z)
MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization [8.057807176915896]
A novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords.<n>Our algorithm is validated on various models for image classification, object detection, and segmentation tasks.<n>Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$times$ and reduces the size of the systolic array by 55% when compared with the base EWS accelerator.
arXiv Detail & Related papers (2024-12-13T16:30:35Z)
Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search [30.003470912691096]
We introduce a new quantization method called RaBitQ that inherits the theoretical guarantees of RaBitQ and achieves optimality in terms of the trade-off between space and error bounds. Our method consistently outperforms the state-of-the-art baselines in both accuracy and efficiency when using the same amount of memory.
arXiv Detail & Related papers (2024-09-16T01:06:23Z)
HyperVQ: MLR-based Vector Quantization in Hyperbolic Space [56.4245885674567]
A common solution is to employ Vector Quantization (VQ) within VQ Variational Autoencoders (VQVAEs)<n>We introduce HyperVQ, a novel approach that formulates VQ as a hyperbolic Multinomial Logistic Regression (MLR) problem.<n>Our experiments demonstrate that HyperVQ matches traditional VQ in generative and reconstruction tasks, while surpassing it in discriminative performance.
arXiv Detail & Related papers (2024-03-18T03:17:08Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Soft Convex Quantization: Revisiting Vector Quantization with Convex Optimization [40.1651740183975]
We propose Soft Convex Quantization (SCQ) as a direct substitute for Vector Quantization (VQ) SCQ works like a differentiable convex optimization (DCO) layer. We demonstrate its efficacy on the CIFAR-10, GTSRB and LSUN datasets.
arXiv Detail & Related papers (2023-10-04T17:45:14Z)
CSMPQ:Class Separability Based Mixed-Precision Quantization [9.005098065862411]
A novel mixed-precision quantization method, termed CSMPQ, is proposed. Specifically, the TF-IDF metric that is widely used in natural language processing (NLP) is introduced to measure the class separability of layer-wise feature maps. Without any iterative process, the proposed CSMPQ achieves better compression trade-offs than the state-of-the-art quantization methods.
arXiv Detail & Related papers (2022-12-20T12:52:19Z)
Accelerating variational quantum algorithms with multiple quantum processors [78.36566711543476]
Variational quantum algorithms (VQAs) have the potential of utilizing near-term quantum machines to gain certain computational advantages. Modern VQAs suffer from cumbersome computational overhead, hampered by the tradition of employing a solitary quantum processor to handle large data. Here we devise an efficient distributed optimization scheme, called QUDIO, to address this issue.
arXiv Detail & Related papers (2021-06-24T08:18:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.