Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets
- URL: http://arxiv.org/abs/2207.06920v1
- Date: Wed, 13 Jul 2022 17:46:08 GMT
- Title: Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets
- Authors: Lu Zeng, Sree Hari Krishnan Parthasarathi, Yuzong Liu, Alex Escott,
Santosh Cheekatmalla, Nikko Strom, Shiv Vitaladevuni
- Abstract summary: We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
- Score: 7.5195830365852085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel 2-stage sub 8-bit quantization aware training algorithm
for all components of a 250K parameter feedforward, streaming, state-free
keyword spotting model. For the 1st-stage, we adapt a recently proposed
quantization technique using a non-linear transformation with tanh(.) on dense
layer weights. In the 2nd-stage, we use linear quantization methods on the rest
of the network, including other parameters (bias, gain, batchnorm), inputs, and
activations. We conduct large scale experiments, training on 26,000 hours of
de-identified production, far-field and near-field audio data (evaluating on
4,000 hours of data). We organize our results in two embedded chipset settings:
a) with commodity ARM NEON instruction set and 8-bit containers, we present
accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and
8-bit quantization of rest of the network; b) with off-the-shelf neural network
accelerators, for a range of weight bit widths (1 and 5-bit), while presenting
accuracy results, we project reduction in memory utilization. In both
configurations, our results show that the proposed algorithm can achieve: a)
parity with a full floating point model's operating point on a detection error
tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection
rate (FRR); b) significant reduction in compute and memory, yielding up to 3
times improvement in CPU consumption and more than 4 times improvement in
memory consumption.
Related papers
- DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network
Accelerator with On-Device Speech Recognition [19.949933989959682]
We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators.
We are able to increase the model parameter size to reduce the word error rate by 4-16% relatively, while still improving latency by 5%.
arXiv Detail & Related papers (2022-06-30T16:52:07Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Post-Training Sparsity-Aware Quantization [2.2530496464901106]
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency.
We propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities.
SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation.
arXiv Detail & Related papers (2021-05-23T20:12:35Z) - VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference [7.886868529510128]
Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors.
Excessive quantization, reducing precision too aggressively, results in accuracy degradation.
Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
arXiv Detail & Related papers (2021-02-08T19:56:04Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Fast Implementation of 4-bit Convolutional Neural Networks for Mobile
Devices [0.8362190332905524]
We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks.
We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset.
The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
arXiv Detail & Related papers (2020-09-14T14:48:40Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z) - Bayesian Bits: Unifying Quantization and Pruning [73.27732135853243]
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization.
We experimentally validate our proposed method on several benchmark datasets and show that we can learn pruned, mixed precision networks.
arXiv Detail & Related papers (2020-05-14T16:00:34Z) - Shifted and Squeezed 8-bit Floating Point format for Low-Precision
Training of Deep Neural Networks [13.929168096016957]
We introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers.
Reduced bit precision allows for a larger effective memory and increased computational speed.
We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models.
arXiv Detail & Related papers (2020-01-16T06:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.