Towards a tailored mixed-precision sub-8-bit quantization scheme for
Gated Recurrent Units using Genetic Algorithms
- URL: http://arxiv.org/abs/2402.12263v2
- Date: Fri, 8 Mar 2024 21:16:13 GMT
- Title: Towards a tailored mixed-precision sub-8-bit quantization scheme for
Gated Recurrent Units using Genetic Algorithms
- Authors: Riccardo Miccini, Alessandro Cerioli, Cl\'ement Laroche, Tobias
Piechowiak, Jens Spars{\o}, Luca Pezzarossa
- Abstract summary: Gated Recurrent Units (GRU) are difficult to tune due to their dependence on an internal state.
We propose a modular integer quantization scheme for GRUs where the bit width of each operator can be selected independently.
- Score: 39.979007027634196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent advances in model compression techniques for deep neural
networks, deploying such models on ultra-low-power embedded devices still
proves challenging. In particular, quantization schemes for Gated Recurrent
Units (GRU) are difficult to tune due to their dependence on an internal state,
preventing them from fully benefiting from sub-8bit quantization. In this work,
we propose a modular integer quantization scheme for GRUs where the bit width
of each operator can be selected independently. We then employ Genetic
Algorithms (GA) to explore the vast search space of possible bit widths,
simultaneously optimising for model size and accuracy. We evaluate our methods
on four different sequential tasks and demonstrate that mixed-precision
solutions exceed homogeneous-precision ones in terms of Pareto efficiency. In
our results, we achieve a model size reduction between 25% and 55% while
maintaining an accuracy comparable with the 8-bit homogeneous equivalent.
Related papers
- Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression [4.356528958652799]
We boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank.
Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic.
We deploy a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.
arXiv Detail & Related papers (2024-09-03T08:50:42Z) - Free Bits: Latency Optimization of Mixed-Precision Quantized Neural
Networks on the Edge [17.277918711842457]
Mixed-precision quantization offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy.
This paper proposes a hybrid search methodology to navigate the search space of mixed-precision configurations for a given network.
It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware optimization to find mixed-precision configurations latency-optimized for a specific hardware target.
arXiv Detail & Related papers (2023-07-06T09:57:48Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network
Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks.
Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space.
This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - FracBits: Mixed Precision Quantization via Fractional Bit-Widths [29.72454879490227]
Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths.
We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target computation constraints.
arXiv Detail & Related papers (2020-07-04T06:09:09Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.