Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers
- URL: http://arxiv.org/abs/2008.05124v1
- Date: Wed, 12 Aug 2020 06:09:58 GMT
- Title: Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers
- Authors: Manuele Rusci, Marco Fariselli, Alessandro Capotondi, Luca Benini
- Abstract summary: This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
- Score: 76.30674794049293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The severe on-chip memory limitations are currently preventing the deployment
of the most accurate Deep Neural Network (DNN) models on tiny MicroController
Units (MCUs), even if leveraging an effective 8-bit quantization scheme. To
tackle this issue, in this paper we present an automated mixed-precision
quantization flow based on the HAQ framework but tailored for the memory and
computational characteristics of MCU devices. Specifically, a Reinforcement
Learning agent searches for the best uniform quantization levels, among 2, 4, 8
bits, of individual weight and activation tensors, under the tight constraints
on RAM and FLASH embedded memory sizes. We conduct an experimental analysis on
MobileNetV1, MobileNetV2 and MNasNet models for Imagenet classification.
Concerning the quantization policy search, the RL agent selects quantization
policies that maximize the memory utilization. Given an MCU-class memory bound
of 2MB for weight-only quantization, the compressed models produced by the
mixed-precision engine result as accurate as the state-of-the-art solutions
quantized with a non-uniform function, which is not tailored for CPUs featuring
integer-only arithmetic. This denotes the viability of uniform quantization,
required for MCU deployments, for deep weights compression. When also limiting
the activation memory budget to 512kB, the best MobileNetV1 model scores up to
68.4% on Imagenet thanks to the found quantization policy, resulting to be 4%
more accurate than the other 8-bit networks fitting the same memory
constraints.
Related papers
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking
Neural Networks [20.473852621915956]
We propose a uniform quantization scheme that efficiently compresses weights and membrane potentials in spiking neural networks (SNNs)
MINT quantizes membrane potentials to an extremely low precision (2-bit), significantly reducing the memory footprint.
Experimental results show that our method matches the accuracy of full-precision models and other state-of-the-art SNN quantization techniques.
arXiv Detail & Related papers (2023-05-16T23:38:35Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z) - Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.