SBVR: Summation of BitVector Representation for Efficient LLM Quantization
- URL: http://arxiv.org/abs/2509.18172v1
- Date: Wed, 17 Sep 2025 13:51:27 GMT
- Title: SBVR: Summation of BitVector Representation for Efficient LLM Quantization
- Authors: Wonjun Bang, Jongseok Park, Hongseung Yu, Kyungmin Bin, Kyunghan Lee,
- Abstract summary: Quantization compression by limiting the number of representable points in the data is key to efficient quantization.<n>Existing Post-Training Quantization (PTQ) solutions adopt two major approaches to this problem: Round-To-Nearest (RTN)-based methods and codebook-based methods.<n>We propose SBVR (Summation of Bitplex Representation), that enables Gaussian-like code representation in a hardware-friendly manner for fast inference.
- Score: 3.7018544730078413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of large language models (LLMs), numerous Post-Training Quantization (PTQ) strategies have been proposed to alleviate deployment barriers created by their enormous parameter counts. Quantization achieves compression by limiting the number of representable points in the data. Therefore, the key to achieving efficient quantization is selecting the optimal combination of representation points, or codes, for the given data. Existing PTQ solutions adopt two major approaches to this problem: Round-To-Nearest (RTN)-based methods and codebook-based methods. RTN-based methods map LLM weights onto uniformly distributed integer grids, failing to account for the Gaussian-like weight distribution of LLM weights. Codebook-based methods mitigate this issue by constructing distribution-aware codebooks; however, they suffer from random and strided memory access patterns, resulting in degraded inference speed that is exacerbated by the limited size of GPU L1 cache. To overcome these limitations, we propose a novel LLM quantization method, SBVR (Summation of BitVector Representation), that enables Gaussian-like code representation in a hardware-friendly manner for fast inference. SBVR maps weight values to non-uniform representation points whose distribution follows the actual distribution of LLM weights, enabling more accurate compression. Additionally, we design a custom CUDA kernel that allows matrix-vector multiplication directly in the SBVR format without decompression, thereby enabling high-performance execution of SBVR-compressed models. Our evaluations of SBVR on various models demonstrate state-of-the-art perplexity and accuracy benchmark performance while delivering a 2.21x- 3.04x end-to-end token-generation speedup over naive FP16 models in the 4-bit quantization regime.
Related papers
- Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z) - Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models [47.41616630151171]
Diffusion large language models (dLLMs) offer bidirectional context and flexible masked-denoising generation.<n>We propose Quant-dLLM, an ultra-lowbit PTQ framework tailored to dLLMs.<n> Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs.
arXiv Detail & Related papers (2025-09-27T13:50:42Z) - FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models.<n>It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters.<n>It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z) - GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models [2.1388885579612804]
GANQ is a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM.<n>Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization.
arXiv Detail & Related papers (2025-01-22T15:29:09Z) - The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs)<n>In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR)<n>RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.<n>Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.