Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization
- URL: http://arxiv.org/abs/2210.07692v1
- Date: Fri, 14 Oct 2022 10:32:05 GMT
- Title: Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization
- Authors: Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric
Flamand
- Abstract summary: Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an optimized methodology to design and deploy Speech
Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a
state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V
cores. To achieve low-latency execution, we propose an optimized software
pipeline interleaving parallel computation of LSTM or GRU recurrent blocks,
featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16)
compute units, with manually-managed memory transfers of model parameters. To
ensure minimal accuracy degradation with respect to the full-precision models,
we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ)
scheme that compresses the recurrent layers to 8-bit while the bit precision of
remaining layers is kept to FP16. Experiments are conducted on multiple LSTM
and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M
parameters. Thanks to the proposed approaches, we speed-up the computation by
up to 4x with respect to the lossless FP16 baselines. Differently from a
uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the
Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while
achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the
power cost of the external memory by fitting the large models on the limited
on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by
reducing the supply voltage from 0.8V to 0.65V while still matching the
real-time constraints. Our design results 10x more energy efficient than
state-of-the-art SE solutions deployed on single-core MCUs that make use of
smaller models and quantization-aware training.
Related papers
- BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z) - BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling [12.43240392025487]
We introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax.<n>Our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss.<n>Our work paves the way for doubling end-to-end inference throughput without increasing chip area.
arXiv Detail & Related papers (2026-02-02T13:12:18Z) - ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs [4.431548809730958]
ARCQuant is a framework that boosts NVFP4 performance via Augmented Residual Channels.<n>We show that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks.
arXiv Detail & Related papers (2026-01-12T12:27:22Z) - EdgeFlex-Transformer: Transformer Inference for Edge Devices [2.1130318406254074]
We propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs)<n>Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning.<n>Experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline.
arXiv Detail & Related papers (2025-12-17T21:45:12Z) - FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error [3.281844093101284]
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands.<n>We propose FP8-Flow-MoE, a training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware computation and fused FP8 operators.
arXiv Detail & Related papers (2025-11-04T06:36:59Z) - Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z) - MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization [16.83403134551842]
Recent few-step diffusion models reduce the inference time by reducing the denoising steps.
The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values.
However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment.
arXiv Detail & Related papers (2024-05-28T06:50:58Z) - Optimizing the Deployment of Tiny Transformers on Low-Power MCUs [12.905978154498499]
This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs.
Our framework provides an optimized library of kernels to maximize data reuse and avoid data marshaling operations into the crucial attention block.
We show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%.
arXiv Detail & Related papers (2024-04-03T14:14:08Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z) - Reduced Precision Floating-Point Optimization for Deep Neural Network
On-Device Learning on MicroControllers [15.37318446043671]
This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices.
Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
arXiv Detail & Related papers (2023-05-30T16:14:16Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Accelerating Inference and Language Model Fusion of Recurrent Neural
Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T)
We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model.
We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.