Related papers: Enhancing Post-Training Quantization via Future Activation Awareness

Enhancing Post-Training Quantization via Future Activation Awareness

URL: http://arxiv.org/abs/2602.02538v1
Date: Wed, 28 Jan 2026 12:03:30 GMT
Title: Enhancing Post-Training Quantization via Future Activation Awareness
Authors: Zheqi Lv, Zhenxuan Fan, Qi Tian, Wenqiao Zhang, Yueting Zhuang,
Abstract summary: Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning.<n>We propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization.<n>FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning.
Score: 84.76726857601753
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.

Related papers

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT [2.510925330348642]
GranQ is a novel activation quantization framework that introduces an efficient pre-scaling strategy.<n>It consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet.<n>Our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10.
arXiv Detail & Related papers (2025-03-24T04:44:21Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.<n>Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.<n>This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z)
BoA: Attention-aware Post-training Quantization without Backpropagation [11.096116957844014]
Post-training quantization is a promising solution for deploying large language models on resource-constrained devices.<n>We introduce a novel backpropagation-free PTQ algorithm that optimize quantized weights by considering inter-layer dependencies.
arXiv Detail & Related papers (2024-06-19T11:53:21Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization [9.062897838978955]
Various post-training quant uniformization methods have usually been based on convex optimization. Our proposed technique presents higher model accuracy, especially for a low quantization.
arXiv Detail & Related papers (2021-05-05T05:14:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.