Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization
- URL: http://arxiv.org/abs/2602.07596v1
- Date: Sat, 07 Feb 2026 15:50:18 GMT
- Title: Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization
- Authors: Xi Chen, Ming Li, Junxi Li, Changsheng Li, Peisong Wang, Lizhong Ding, Ye Yuan, Guoren Wang,
- Abstract summary: We propose an Activation-guided Structured Regularization framework to suppress the negative effects of outliers.<n>Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations.<n>Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
- Score: 56.5199302532159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight-only post-training quantization (PTQ) is crucial for efficient Large Language Model (LLM) deployment but suffers from accuracy degradation caused by weight and activation outliers. Existing mitigation strategies often face critical limitations: they either yield insufficient outlier suppression or incur significant deployment inefficiencies, such as inference latency, heavy preprocessing, or reliance on complex operator fusion. To resolve these limitations, we leverage a key insight: over-parameterized LLMs often converge to Flat Minima, implying a vast equivalent solution space where weights can be adjusted without compromising accuracy. Building on this, we propose Astro, an Activation-guided Structured Regularization framework designed to suppress the negative effects of outliers in a hardware-friendly and efficient manner. Leveraging the activation-guided regularization objective, Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations without sacrificing model accuracy. Crucially, Astro introduces zero inference latency and is orthogonal to mainstream quantization methods like GPTQ. Extensive experiments show that Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
Related papers
- A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs [64.8510381475827]
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently.<n>SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized.<n>We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set.
arXiv Detail & Related papers (2026-02-23T15:11:16Z) - Revisiting Weight Regularization for Low-Rank Continual Learning [42.550292504567935]
Continual Learning with large-scale pre-trained models (PTMs) has recently gained wide attention.<n> task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters.<n>Weight regularization techniques, such as Elastic Weight Consolidation (EWC)-a key strategy in CL-remain underexplored in this new paradigm.
arXiv Detail & Related papers (2026-02-19T17:13:00Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization [5.718172547021947]
We propose Noise Perturbation Fine-tuning (NPFT) to tame the sensitive weights' impact on the quantization error.<n>NPFT identifies outlier weights and add random weight perturbations on the outliers as the model going through a PEFT optimization.<n>When applied to OPT and LLaMA models, our NPFT method achieves stable performance improvements for both uniform and non-uniform quantizers.
arXiv Detail & Related papers (2024-12-08T21:46:22Z) - Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale.
Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation.
We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation.
The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - BoA: Attention-aware Post-training Quantization without Backpropagation [11.096116957844014]
Post-training quantization is a promising solution for deploying large language models on resource-constrained devices.<n>We introduce a novel backpropagation-free PTQ algorithm that optimize quantized weights by considering inter-layer dependencies.
arXiv Detail & Related papers (2024-06-19T11:53:21Z) - Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs [27.38239289662178]
Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs)
We explore the role of calibration sets in PTQ, specifically their effect on hidden activations.
Our analysis reveals a marked contrast in quantization effectiveness across accessible models.
arXiv Detail & Related papers (2024-05-31T14:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.