QFT: Post-training quantization via fast joint finetuning of all degrees
of freedom
- URL: http://arxiv.org/abs/2212.02634v1
- Date: Mon, 5 Dec 2022 22:38:58 GMT
- Title: QFT: Post-training quantization via fast joint finetuning of all degrees
of freedom
- Authors: Alex Finkelstein, Ella Fuchs, Idan Tal, Mark Grobman, Niv Vosco, Eldad
Meller
- Abstract summary: We rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF.
Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA.
- Score: 1.1744028458220428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The post-training quantization (PTQ) challenge of bringing quantized neural
net accuracy close to original has drawn much attention driven by industry
demand. Many of the methods emphasize optimization of a specific
degree-of-freedom (DoF), such as quantization step size, preconditioning
factors, bias fixing, often chained to others in multi-step solutions. Here we
rethink quantized network parameterization in HW-aware fashion, towards a
unified analysis of all quantization DoF, permitting for the first time their
joint end-to-end finetuning. Our single-step simple and extendable method,
dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization
results on-par with SoTA within PTQ constraints of speed and resource.
Related papers
- PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models.
We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - MRQ:Support Multiple Quantization Schemes through Model Re-Quantization [0.17499351967216337]
Deep learning models cannot be easily quantized for diverse fixed-point hardwares.
New type of model quantization approach called model re-quantization is proposed.
Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
arXiv Detail & Related papers (2023-08-01T08:15:30Z) - Quantized Feature Distillation for Network Quantization [32.26577845735846]
Methods adopting the quantization aware training (QAT) paradigm have recently seen a rapid growth, but are often conceptually complicated.
This paper proposes a novel and highly effective QAT method, quantized feature distillation (QFD)
QFD first trains a quantized (or binarized) representation as the teacher, then quantize the network using knowledge distillation (KD)
arXiv Detail & Related papers (2023-07-20T07:08:24Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Distribution-Flexible Subset Quantization for Post-Quantizing
Super-Resolution Networks [68.83451203841624]
This paper introduces Distribution-Flexible Subset Quantization (DFSQ), a post-training quantization method for super-resolution networks.
DFSQ conducts channel-wise normalization of the activations and applies distribution-flexible subset quantization (SQ)
It achieves comparable performance to full-precision counterparts on 6- and 8-bit quantization, and incurs only a 0.1 dB PSNR drop on 4-bit quantization.
arXiv Detail & Related papers (2023-05-10T04:19:11Z) - A self-consistent field approach for the variational quantum
eigensolver: orbital optimization goes adaptive [52.77024349608834]
We present a self consistent field approach (SCF) within the Adaptive Derivative-Assembled Problem-Assembled Ansatz Variational Eigensolver (ADAPTVQE)
This framework is used for efficient quantum simulations of chemical systems on nearterm quantum computers.
arXiv Detail & Related papers (2022-12-21T23:15:17Z) - NIPQ: Noise proxy-based Integrated Pseudo-Quantization [9.207644534257543]
Straight-through estimator (STE) incurs unstable convergence during quantization-aware training (QAT)
We propose a novel noise proxy-based integrated pseudoquantization (NIPQ) that enables unified support of pseudoquantization for both activation and weight.
NIPQ outperforms existing quantization algorithms in various vision and language applications by a large margin.
arXiv Detail & Related papers (2022-06-02T01:17:40Z) - A Statistical Framework for Low-bitwidth Training of Deep Neural
Networks [70.77754244060384]
Fully quantized training (FQT) uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model.
One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties.
arXiv Detail & Related papers (2020-10-27T13:57:33Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.