Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization
- URL: http://arxiv.org/abs/2210.07692v1
- Date: Fri, 14 Oct 2022 10:32:05 GMT
- Title: Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization
- Authors: Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric
Flamand
- Abstract summary: Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an optimized methodology to design and deploy Speech
Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a
state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V
cores. To achieve low-latency execution, we propose an optimized software
pipeline interleaving parallel computation of LSTM or GRU recurrent blocks,
featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16)
compute units, with manually-managed memory transfers of model parameters. To
ensure minimal accuracy degradation with respect to the full-precision models,
we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ)
scheme that compresses the recurrent layers to 8-bit while the bit precision of
remaining layers is kept to FP16. Experiments are conducted on multiple LSTM
and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M
parameters. Thanks to the proposed approaches, we speed-up the computation by
up to 4x with respect to the lossless FP16 baselines. Differently from a
uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the
Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while
achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the
power cost of the external memory by fitting the large models on the limited
on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by
reducing the supply voltage from 0.8V to 0.65V while still matching the
real-time constraints. Our design results 10x more energy efficient than
state-of-the-art SE solutions deployed on single-core MCUs that make use of
smaller models and quantization-aware training.
Related papers
- MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization [16.83403134551842]
Recent few-step diffusion models reduce the inference time by reducing the denoising steps.
The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values.
However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment.
arXiv Detail & Related papers (2024-05-28T06:50:58Z) - Optimizing the Deployment of Tiny Transformers on Low-Power MCUs [12.905978154498499]
This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs.
Our framework provides an optimized library of kernels to maximize data reuse and avoid data marshaling operations into the crucial attention block.
We show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%.
arXiv Detail & Related papers (2024-04-03T14:14:08Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z) - Reduced Precision Floating-Point Optimization for Deep Neural Network
On-Device Learning on MicroControllers [15.37318446043671]
This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices.
Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
arXiv Detail & Related papers (2023-05-30T16:14:16Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Accelerating Inference and Language Model Fusion of Recurrent Neural
Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T)
We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model.
We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.