Related papers: HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression

HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression

URL: http://arxiv.org/abs/2601.06959v1
Date: Sun, 11 Jan 2026 15:35:10 GMT
Title: HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression
Authors: Vladimer Khasia,
Abstract summary: We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution.<n>We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training quantization is essential for deploying Large Language Models (LLMs) on resource- constrained devices. However, standard integer quantization (e.g., INT4) fundamentally degrades per- formance by imposing a uniform grid on the heavy-tailed distribution of weight parameters, particularly in smaller-scale models (e.g., <2B parameters). We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution using second-order sensitivity analysis. HAS-VQ employs a Hessian-Masked Decoupling strategy to isolate sensitive parameters, followed by robust Vector Quantization (VQ) of the remaining dense body. Crucially, we introduce a residual sparse feedback mechanism that corrects quan- tization errors in the most sensitive dimensions, ensuring exact reconstruction of outliers. We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority: (1) Pareto Dominance over Integer Baselines: At 4.23 effective bits-per-parameter (BPP), we achieve a perplexity of 14.23, significantly outperforming the standard INT4 baseline (20.03 PPL at 4.71 BPP). (2) High-Fidelity Compression: Relative to the FP16 baseline, HAS-VQ achieves a 2.3x reduction in model size (7.03 BPP) while maintaining statistically indistinguishable perplexity (10.12 vs. 10.04), effectively offering a lossless compression alternative for bandwidth-constrained environments. The code is available at https://github.com/VladimerKhasia/HASVQ

Related papers

Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs [12.753341915660073]
We propose a Geometric-Aware Quantization (GAQ) framework that compresses and accelerates equivariant models.<n>On consumer hardware, GAQ achieves 2.39x inference speedup and 4x memory reduction, enabling stable, energy-conserving molecular dynamics simulations.
arXiv Detail & Related papers (2026-03-05T16:20:21Z)
SPQ: An Ensemble Technique for Large Language Model Compression [1.2891210250935148]
SPQ (SVD-Pruning-Quantization) is an ensemble technique for large language model LLM compression.<n>It achieves up to 75% memory reduction while maintaining or improving perplexity.<n>It improves inference over GPTQ, achieving up to a 1.9x speedup.
arXiv Detail & Related papers (2026-02-20T18:44:16Z)
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z)
D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs [33.883527341335856]
Weight-only post-training quantization (PTQ) is appealing as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware.<n> accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision.<n>We propose D$2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives.
arXiv Detail & Related papers (2026-01-30T05:49:48Z)
Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models [0.0]
Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower.<n>Current state-of-the-art methods rely on calibration data to identify salient weights.<n>We propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance.
arXiv Detail & Related papers (2025-12-01T06:58:30Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Quantized Visual Geometry Grounded Transformer [67.15451442018258]
This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT.<n>We introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing.<n>We also design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics.
arXiv Detail & Related papers (2025-09-25T15:17:11Z)
RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models [17.273189597394072]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>Their exponentially increasing parameters pose significant challenges for deployment on resource-constrained devices.<n>We propose RSAVQ, a novel framework to enhance extremely low-bit quantization for LLMs.
arXiv Detail & Related papers (2025-09-24T01:40:32Z)
IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs [4.655407920049974]
Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits.<n>In this paper, we propose two innovations to address these limitations.<n>First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE)<n>Second, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation.
arXiv Detail & Related papers (2025-09-18T21:59:40Z)
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling [53.91873442457923]
Vector Quantization (VQ) serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy.<n>This paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework.<n> Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5% zero-shot accuracy.
arXiv Detail & Related papers (2025-06-05T08:58:58Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals [10.860081994662645]
Post-training quantization of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time.<n>We propose ResQ, a PTQ method that pushes further the state-of-the-art.<n>We demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks.
arXiv Detail & Related papers (2024-12-18T22:01:55Z)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [61.474101404805545]
Diffusion models can generate high-quality images, but as they scale, rising memory demands and higher latency pose deployment challenges.<n>We propose SVDQuant, a new 4-bit quantization paradigm to overcome this limitation.<n>We reduce the memory usage for the 12B FLUX.1 models by 3.5$times$, achieving 3.0$times$ speedup over the 4-bit weight-only quantization (W4A16) baseline.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.