Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization
- URL: http://arxiv.org/abs/2105.01868v1
- Date: Wed, 5 May 2021 05:14:22 GMT
- Title: Q-Rater: Non-Convex Optimization for Post-Training Uniform Quantization
- Authors: Byeongwook Kim, Dongsoo Lee, Yeonju Ro, Yongkweon Jeon, Se Jung Kwon,
Baeseong Park, Daehwan Oh
- Abstract summary: Various post-training quant uniformization methods have usually been based on convex optimization.
Our proposed technique presents higher model accuracy, especially for a low quantization.
- Score: 9.062897838978955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various post-training uniform quantization methods have usually been studied
based on convex optimization. As a result, most previous ones rely on the
quantization error minimization and/or quadratic approximations. Such
approaches are computationally efficient and reasonable when a large number of
quantization bits are employed. When the number of quantization bits is
relatively low, however, non-convex optimization is unavoidable to improve
model accuracy. In this paper, we propose a new post-training uniform
quantization technique considering non-convexity. We empirically show that
hyper-parameters for clipping and rounding of weights and activations can be
explored by monitoring task loss. Then, an optimally searched set of
hyper-parameters is frozen to proceed to the next layer such that an
incremental non-convex optimization is enabled for post-training quantization.
Throughout extensive experimental results using various models, our proposed
technique presents higher model accuracy, especially for a low-bit
quantization.
Related papers
- QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models.
We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding.
QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models.
PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z) - MixQuant: Mixed Precision Quantization with a Bit-width Optimization
Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs)
We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error.
We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Towards Accurate Post-training Quantization for Diffusion Models [73.19871905102545]
We propose an accurate data-free post-training quantization framework of diffusion models (ADP-DM) for efficient image generation.
Our method outperforms the state-of-the-art post-training quantization of diffusion model by a sizable margin with similar computational cost.
arXiv Detail & Related papers (2023-05-30T04:00:35Z) - Ternary Quantization: A Survey [12.90416661059601]
Inference time, model size, and accuracy are critical for deploying deep neural network models.
We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods.
arXiv Detail & Related papers (2023-03-02T03:38:51Z) - Sharp Calibrated Gaussian Processes [58.94710279601622]
State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance.
We present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance.
Our approach is shown to yield a calibrated model under reasonable assumptions.
arXiv Detail & Related papers (2023-02-23T12:17:36Z) - Towards Mixed-Precision Quantization of Neural Networks via Constrained
Optimization [28.76708310896311]
We present a principled framework to solve the mixed-precision quantization problem.
We show that our method is derived in a principled way and much more computationally efficient.
arXiv Detail & Related papers (2021-10-13T08:09:26Z) - Variational Quantum Optimization with Multi-Basis Encodings [62.72309460291971]
We introduce a new variational quantum algorithm that benefits from two innovations: multi-basis graph complexity and nonlinear activation functions.
Our results in increased optimization performance, two increase in effective landscapes and a reduction in measurement progress.
arXiv Detail & Related papers (2021-06-24T20:16:02Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.