HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression
- URL: http://arxiv.org/abs/2601.06959v1
- Date: Sun, 11 Jan 2026 15:35:10 GMT
- Title: HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression
- Authors: Vladimer Khasia,
- Abstract summary: We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution.<n>We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training quantization is essential for deploying Large Language Models (LLMs) on resource- constrained devices. However, standard integer quantization (e.g., INT4) fundamentally degrades per- formance by imposing a uniform grid on the heavy-tailed distribution of weight parameters, particularly in smaller-scale models (e.g., <2B parameters). We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution using second-order sensitivity analysis. HAS-VQ employs a Hessian-Masked Decoupling strategy to isolate sensitive parameters, followed by robust Vector Quantization (VQ) of the remaining dense body. Crucially, we introduce a residual sparse feedback mechanism that corrects quan- tization errors in the most sensitive dimensions, ensuring exact reconstruction of outliers. We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority: (1) Pareto Dominance over Integer Baselines: At 4.23 effective bits-per-parameter (BPP), we achieve a perplexity of 14.23, significantly outperforming the standard INT4 baseline (20.03 PPL at 4.71 BPP). (2) High-Fidelity Compression: Relative to the FP16 baseline, HAS-VQ achieves a 2.3x reduction in model size (7.03 BPP) while maintaining statistically indistinguishable perplexity (10.12 vs. 10.04), effectively offering a lossless compression alternative for bandwidth-constrained environments. The code is available at https://github.com/VladimerKhasia/HASVQ
Related papers
- Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs [12.753341915660073]
We propose a Geometric-Aware Quantization (GAQ) framework that compresses and accelerates equivariant models.<n>On consumer hardware, GAQ achieves 2.39x inference speedup and 4x memory reduction, enabling stable, energy-conserving molecular dynamics simulations.
arXiv Detail & Related papers (2026-03-05T16:20:21Z) - SPQ: An Ensemble Technique for Large Language Model Compression [1.2891210250935148]
SPQ (SVD-Pruning-Quantization) is an ensemble technique for large language model LLM compression.<n>It achieves up to 75% memory reduction while maintaining or improving perplexity.<n>It improves inference over GPTQ, achieving up to a 1.9x speedup.
arXiv Detail & Related papers (2026-02-20T18:44:16Z) - BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z) - D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs [33.883527341335856]
Weight-only post-training quantization (PTQ) is appealing as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware.<n> accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision.<n>We propose D$2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives.
arXiv Detail & Related papers (2026-01-30T05:49:48Z) - Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models [0.0]
Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower.<n>Current state-of-the-art methods rely on calibration data to identify salient weights.<n>We propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance.
arXiv Detail & Related papers (2025-12-01T06:58:30Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - Quantized Visual Geometry Grounded Transformer [67.15451442018258]
This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT.<n>We introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing.<n>We also design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics.
arXiv Detail & Related papers (2025-09-25T15:17:11Z) - RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models [17.273189597394072]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>Their exponentially increasing parameters pose significant challenges for deployment on resource-constrained devices.<n>We propose RSAVQ, a novel framework to enhance extremely low-bit quantization for LLMs.
arXiv Detail & Related papers (2025-09-24T01:40:32Z) - IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs [4.655407920049974]
Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits.<n>In this paper, we propose two innovations to address these limitations.<n>First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE)<n>Second, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation.
arXiv Detail & Related papers (2025-09-18T21:59:40Z) - PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling [53.91873442457923]
Vector Quantization (VQ) serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy.<n>This paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework.<n> Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5% zero-shot accuracy.
arXiv Detail & Related papers (2025-06-05T08:58:58Z) - PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals [10.860081994662645]
Post-training quantization of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time.<n>We propose ResQ, a PTQ method that pushes further the state-of-the-art.<n>We demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks.
arXiv Detail & Related papers (2024-12-18T22:01:55Z) - SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [61.474101404805545]
Diffusion models can generate high-quality images, but as they scale, rising memory demands and higher latency pose deployment challenges.<n>We propose SVDQuant, a new 4-bit quantization paradigm to overcome this limitation.<n>We reduce the memory usage for the 12B FLUX.1 models by 3.5$times$, achieving 3.0$times$ speedup over the 4-bit weight-only quantization (W4A16) baseline.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.