Value-Driven Mixed-Precision Quantization for Patch-Based Inference on
Microcontrollers
- URL: http://arxiv.org/abs/2401.13714v1
- Date: Wed, 24 Jan 2024 04:21:41 GMT
- Title: Value-Driven Mixed-Precision Quantization for Patch-Based Inference on
Microcontrollers
- Authors: Wei Tao, Shenglin He, Kai Lu, Xiaoyang Qu, Guokuan Li, Jiguang Wan,
Jianzong Wang, Jing Xiao
- Abstract summary: QuantMCU is a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation.
We show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy.
- Score: 35.666772630923234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying neural networks on microcontroller units (MCUs) presents
substantial challenges due to their constrained computation and memory
resources. Previous researches have explored patch-based inference as a
strategy to conserve memory without sacrificing model accuracy. However, this
technique suffers from severe redundant computation overhead, leading to a
substantial increase in execution latency. A feasible solution to address this
issue is mixed-precision quantization, but it faces the challenges of accuracy
degradation and a time-consuming search time. In this paper, we propose
QuantMCU, a novel patch-based inference method that utilizes value-driven
mixed-precision quantization to reduce redundant computation. We first utilize
value-driven patch classification (VDPC) to maintain the model accuracy. VDPC
classifies patches into two classes based on whether they contain outlier
values. For patches containing outlier values, we apply 8-bit quantization to
the feature maps on the dataflow branches that follow. In addition, for patches
without outlier values, we utilize value-driven quantization search (VDQS) on
the feature maps of their following dataflow branches to reduce search time.
Specifically, VDQS introduces a novel quantization search metric that takes
into account both computation and accuracy, and it employs entropy as an
accuracy representation to avoid additional training. VDQS also adopts an
iterative approach to determine the bitwidth of each feature map to further
accelerate the search process. Experimental results on real-world MCU devices
show that QuantMCU can reduce computation by 2.2x on average while maintaining
comparable model accuracy compared to the state-of-the-art patch-based
inference methods.
Related papers
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning [5.41530201129053]
This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups.
Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM)
By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table.
The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
arXiv Detail & Related papers (2023-07-14T21:01:59Z) - Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision
Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs.
Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z) - AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance.
Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z) - Mixed-Precision Neural Network Quantization via Learned Layer-wise
Importance [50.00102219630088]
Mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer.
We propose a joint training scheme that can obtain all indicators at once.
For example, MPQ search on ResNet18 with our indicators takes only 0.06 seconds.
arXiv Detail & Related papers (2022-03-16T03:23:50Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Effective and Fast: A Novel Sequential Single Path Search for
Mixed-Precision Quantization [45.22093693422085]
Mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance.
It is a difficult problem to quickly determine the quantization bit-precision of each layer in deep neural networks according to some constraints.
We propose a novel sequential single path search (SSPS) method for mixed-precision quantization.
arXiv Detail & Related papers (2021-03-04T09:15:08Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.