GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT
- URL: http://arxiv.org/abs/2503.18339v6
- Date: Wed, 15 Oct 2025 04:56:20 GMT
- Title: GranQ: Efficient Channel-wise Quantization via Vectorized Pre-Scaling for Zero-Shot QAT
- Authors: Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Kijung Lee, Sanghyun Park,
- Abstract summary: GranQ is a novel activation quantization framework that introduces an efficient pre-scaling strategy.<n>It consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet.<n>Our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10.
- Score: 2.510925330348642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10.
Related papers
- Quantization-Aware Regularizers for Deep Neural Networks Compression [0.061173711613792085]
We introduce per-layer regularization terms that drive weights to naturally form clusters during training.<n>This reduces the accuracy loss typically associated with quantization methods.<n> Experiments on CIFAR-10 with AlexNet and VGG16 models confirm the effectiveness of the proposed strategy.
arXiv Detail & Related papers (2026-02-03T15:07:43Z) - Enhancing Post-Training Quantization via Future Activation Awareness [84.76726857601753]
Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning.<n>We propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization.<n>FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning.
arXiv Detail & Related papers (2026-01-28T12:03:30Z) - CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training [73.46600457802693]
We introduce a new method that counteracts the loss induced by quantization.<n>CAGE significantly improves upon the state-of-theart methods in terms of accuracy, for similar computational cost.<n>For QAT pre-training of Llama models, CAGE matches the accuracy achieved at 4-bits (W4A4) with the prior best method.
arXiv Detail & Related papers (2025-10-21T16:33:57Z) - DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization [29.066284789131494]
Recent post-training quantization methods overlook outliers, leading to degraded performance at low bit-widths.<n>We propose a DMQ which combines Learned Equivalent Scaling and channel-wise Power-of-Two Scaling.<n>Our method significantly outperforms existing works, especially at low bit-widths.
arXiv Detail & Related papers (2025-07-17T09:15:29Z) - FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models.<n>It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters.<n>It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z) - Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization [0.0]
Post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining.
The accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes.
We propose Quantization Error propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error.
arXiv Detail & Related papers (2025-04-13T15:56:00Z) - PCGS: Progressive Compression of 3D Gaussian Splatting [55.149325473447384]
We propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians.<n>Overall, PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods.
arXiv Detail & Related papers (2025-03-11T15:01:11Z) - PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing [1.8067835669244101]
AKT (Advanced Knowledge Transfer) is a novel method to enhance the training ability of low-bit quantized (Q) models.<n>Our method addresses the fundamental gradient exploding problem in low-bit Q models.
arXiv Detail & Related papers (2024-12-26T08:52:27Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - Leveraging Pre-Trained Neural Networks to Enhance Machine Learning with Variational Quantum Circuits [48.33631905972908]
We introduce an innovative approach that utilizes pre-trained neural networks to enhance Variational Quantum Circuits (VQC)
This technique effectively separates approximation error from qubit count and removes the need for restrictive conditions.
Our results extend to applications such as human genome analysis, demonstrating the broad applicability of our approach.
arXiv Detail & Related papers (2024-11-13T12:03:39Z) - QT-DoG: Quantization-aware Training for Domain Generalization [58.439816306817306]
We propose Quantization-aware Training for Domain Generalization (QT-DoG)
QT-DoG exploits quantization as an implicit regularizer by inducing noise in model weights.
We demonstrate that QT-DoG generalizes across various datasets, architectures, and quantization algorithms.
arXiv Detail & Related papers (2024-10-08T13:21:48Z) - Constraint Guided Model Quantization of Neural Networks [0.0]
Constraint Guided Model Quantization (CGMQ) is a quantization aware training algorithm that uses an upper bound on the computational resources and reduces the bit-widths of the parameters of the neural network.
It is shown on MNIST that the performance of CGMQ is competitive with state-of-the-art quantization aware training algorithms.
arXiv Detail & Related papers (2024-09-30T09:41:16Z) - Towards Scalable Quantum Key Distribution: A Machine Learning-Based Cascade Protocol Approach [2.363573186878154]
Quantum Key Distribution (QKD) is a pivotal technology in the quest for secure communication.
Traditional key rate determination methods, dependent on complex mathematical models, often fall short in efficiency and scalability.
We propose an approach that involves integrating machine learning (ML) techniques with the Cascade error correction protocol.
arXiv Detail & Related papers (2024-09-12T13:40:08Z) - Quantum-Train: Rethinking Hybrid Quantum-Classical Machine Learning in the Model Compression Perspective [7.7063925534143705]
We introduce the Quantum-Train(QT) framework, a novel approach that integrates quantum computing with machine learning algorithms.
QT achieves remarkable results by employing a quantum neural network alongside a classical mapping model.
arXiv Detail & Related papers (2024-05-18T14:35:57Z) - Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z) - Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models [88.80146574509195]
Quantization is a promising approach for reducing memory overhead and accelerating inference.
We propose a novel-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs.
arXiv Detail & Related papers (2023-10-20T07:09:56Z) - Calibrating the role of entanglement in variational quantum circuits [0.6435156676256051]
Entanglement is a key property of quantum computing that separates it from its classical counterpart.
We systematically probe the role of entanglement in the working of two variational quantum algorithms.
We find that for the MAX-CUT problem solved using QAOA, the fidelity as a function of entanglement is highly dependent on the number of layers.
In the case of QNNs, trained circuits with high test accuracies are underpinned by higher entanglement, with any enforced limitation in entanglement resulting in a sharp decline in test accuracy.
arXiv Detail & Related papers (2023-10-16T23:36:40Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - Scaling Limits of Quantum Repeater Networks [62.75241407271626]
Quantum networks (QNs) are a promising platform for secure communications, enhanced sensing, and efficient distributed quantum computing.
Due to the fragile nature of quantum states, these networks face significant challenges in terms of scalability.
In this paper, the scaling limits of quantum repeater networks (QRNs) are analyzed.
arXiv Detail & Related papers (2023-05-15T14:57:01Z) - Accelerating the training of single-layer binary neural networks using
the HHL quantum algorithm [58.720142291102135]
We show that useful information can be extracted from the quantum-mechanical implementation of Harrow-Hassidim-Lloyd (HHL)
This paper shows, however, that useful information can be extracted from the quantum-mechanical implementation of HHL, and used to reduce the complexity of finding the solution on the classical side.
arXiv Detail & Related papers (2022-10-23T11:58:05Z) - Synergy Between Quantum Circuits and Tensor Networks: Short-cutting the
Race to Practical Quantum Advantage [43.3054117987806]
We introduce a scalable procedure for harnessing classical computing resources to provide pre-optimized initializations for quantum circuits.
We show this method significantly improves the trainability and performance of PQCs on a variety of problems.
By demonstrating a means of boosting limited quantum resources using classical computers, our approach illustrates the promise of this synergy between quantum and quantum-inspired models in quantum computing.
arXiv Detail & Related papers (2022-08-29T15:24:03Z) - Optimizing Tensor Network Contraction Using Reinforcement Learning [86.05566365115729]
We propose a Reinforcement Learning (RL) approach combined with Graph Neural Networks (GNN) to address the contraction ordering problem.
The problem is extremely challenging due to the huge search space, the heavy-tailed reward distribution, and the challenging credit assignment.
We show how a carefully implemented RL-agent that uses a GNN as the basic policy construct can address these challenges.
arXiv Detail & Related papers (2022-04-18T21:45:13Z) - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network
Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks.
Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space.
This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - A Statistical Framework for Low-bitwidth Training of Deep Neural
Networks [70.77754244060384]
Fully quantized training (FQT) uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model.
One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties.
arXiv Detail & Related papers (2020-10-27T13:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.