Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network
Accelerator with On-Device Speech Recognition
- URL: http://arxiv.org/abs/2206.15408v1
- Date: Thu, 30 Jun 2022 16:52:07 GMT
- Title: Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network
Accelerator with On-Device Speech Recognition
- Authors: Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Nathan Susanj, Athanasios
Mouchtaris, Tariq Afzal, Ariya Rastrow
- Abstract summary: We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators.
We are able to increase the model parameter size to reduce the word error rate by 4-16% relatively, while still improving latency by 5%.
- Score: 19.949933989959682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for
8-bit neural network accelerators. Our method is inspired from Lloyd-Max
compression theory with practical adaptations for a feasible computational
overhead during training. With the quantization centroids derived from a 32-bit
baseline, we augment training loss with a Multi-Regional Absolute Cosine
(MRACos) regularizer that aggregates weights towards their nearest centroid,
effectively acting as a pseudo compressor. Additionally, a periodically invoked
hard compressor is introduced to improve the convergence rate by emulating
runtime model weight quantization. We apply S8BQAT on speech recognition tasks
using Recurrent Neural NetworkTransducer (RNN-T) architecture. With S8BQAT, we
are able to increase the model parameter size to reduce the word error rate by
4-16% relatively, while still improving latency by 5%.
Related papers
- DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Sub-8-bit quantization for on-device speech recognition: a
regularization-free approach [19.84792318335999]
General Quantizer (GQ) is a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids.
GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference.
arXiv Detail & Related papers (2022-10-17T15:42:26Z) - Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z) - 4-bit Conformer with Native Quantization Aware Training for Speech
Recognition [13.997832593421577]
We propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference.
We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique.
For the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets.
arXiv Detail & Related papers (2022-03-29T23:57:15Z) - ActNN: Reducing Training Memory Footprint via 2-Bit Activation
Compressed Training [68.63354877166756]
ActNN is a memory-efficient training framework that stores randomly quantized activations for back propagation.
ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
arXiv Detail & Related papers (2021-04-29T05:50:54Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - EasyQuant: Post-training Quantization via Scale Optimization [15.443708111143412]
The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications.
There are two kinds of quantization methods, training-based quantization and post-training quantization.
arXiv Detail & Related papers (2020-06-30T10:43:02Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - Shifted and Squeezed 8-bit Floating Point format for Low-Precision
Training of Deep Neural Networks [13.929168096016957]
We introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers.
Reduced bit precision allows for a larger effective memory and increased computational speed.
We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models.
arXiv Detail & Related papers (2020-01-16T06:38:27Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.