Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models
- URL: http://arxiv.org/abs/2601.07878v1
- Date: Sun, 11 Jan 2026 15:14:05 GMT
- Title: Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models
- Authors: Deyu Cao, Yixin Yin, Samin Aref,
- Abstract summary: We introduce a sliced Wasserstein loss function for distribution-aware calibration in ultra-low-bit post-training quantization.<n>The proposed loss aligns the output distributions of full-precision and quantized models under random linear projections.<n>We demonstrate the performance gains of our proposed model by incorporating it with two frontier methods known as OmniQuant and TesseraQ.
- Score: 0.5964436882344729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The benefits of most large language models come with steep and often hidden economic and environmental costs due to their resource usage inefficiency during deployment. Model quantization improves energy and memory efficiency through representing model parameters by lower-precision values. However, compression below 4-bits often distorts activation distributions and degrades performance. We address this challenge by introducing a sliced Wasserstein loss function for distribution-aware calibration in ultra-low-bit post-training quantization. The proposed loss aligns the output distributions of full-precision and quantized models under random linear projections, complementing standard mean-squared error loss without adding any computational overhead during inference. Our proposed loss function can be incorporated with any post-training quantization framework that has a retraining component. We demonstrate the performance gains of our proposed model by incorporating it with two frontier methods known as OmniQuant and TesseraQ. Compared to these two baselines, the proposed loss consistently improves both perplexity and downstream task accuracy across multiple ultra-low-bit settings. Our proposed loss function recovers 4.12-20.37% of the OmniQuant's lost accuracy on the language model LLaMA-2-7B, 0.93-7.65% on OPT-6.7B, and 2.26-6.20% on LLaMA-2-13B. TesseraQ's accuracy degradation is recovered by 3.63-7.63% in relative terms when augmented by our proposed loss function. Taken together, these results demonstrate that distributional alignment provides a simple yet effective performance boost that can push the limits of frontier quantization methods. Our method is available on GitHub to facilitate future progress in ultra-low-bit quantization.
Related papers
- BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z) - ECO: Quantized Training without Full-Precision Master Weights [58.97082407934466]
Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
arXiv Detail & Related papers (2026-01-29T18:35:01Z) - CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training [73.46600457802693]
We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
arXiv Detail & Related papers (2025-10-21T16:33:57Z) - First-Order Error Matters: Accurate Compensation for Quantized Large Language Models [32.69069234109942]
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs)<n>Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error.<n>We propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation.
arXiv Detail & Related papers (2025-07-15T06:18:46Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition [21.13478769431063]
QUAD (Quantization with Activation Decomposition) is a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization.<n>We show QUAD achieves 94% 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models.
arXiv Detail & Related papers (2025-03-25T05:03:56Z) - Minimize Quantization Output Error with Bias Compensation [35.43358597502087]
Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs)
In this paper, we propose a method that improves accuracy without quantizing the output error.
We conduct experiments on Vision models and Large Language Models.
arXiv Detail & Related papers (2024-04-02T12:29:31Z) - QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven
Fine Tuning [16.50084447690437]
The study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, textbfQuantTune.
Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT.
arXiv Detail & Related papers (2024-03-11T08:09:30Z) - QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we identify imbalanced activation distributions as a primary source of quantization difficulty.<n>We propose to adjust these distributions through weight finetuning to be more quantization-friendly.<n>Our method demonstrates its efficacy across three high-resolution image generation tasks.
arXiv Detail & Related papers (2024-02-06T03:39:44Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT)
Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%.
Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.