Multi-objective Recurrent Neural Networks Optimization for the Edge -- a
Quantization-based Approach
- URL: http://arxiv.org/abs/2108.01192v1
- Date: Mon, 2 Aug 2021 22:09:12 GMT
- Title: Multi-objective Recurrent Neural Networks Optimization for the Edge -- a
Quantization-based Approach
- Authors: Nesma M. Rezk, Tomas Nordstr\"om, Dimitrios Stathis, Zain Ul-Abdin,
Eren Erdal Aksoy, Ahmed Hemani
- Abstract summary: This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers both hardware efficiency and inference error as objectives for mixed-precision quantization.
We propose a search technique named "beacon-based search" to retrain selected solutions only in search space and use them as beacons to know effect of retraining on other solutions.
- Score: 2.1987431057890467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The compression of deep learning models is of fundamental importance in
deploying such models to edge devices. Incorporating hardware model and
application constraints during compression maximizes the benefits but makes it
specifically designed for one case. Therefore, the compression needs to be
automated. Searching for the optimal compression method parameters is
considered an optimization problem. This article introduces a Multi-Objective
Hardware-Aware Quantization (MOHAQ) method, which considers both hardware
efficiency and inference error as objectives for mixed-precision quantization.
The proposed method makes the evaluation of candidate solutions in a large
search space feasible by relying on two steps. First, post-training
quantization is applied for fast solution evaluation. Second, we propose a
search technique named "beacon-based search" to retrain selected solutions only
in the search space and use them as beacons to know the effect of retraining on
other solutions. To evaluate the optimization potential, we chose a speech
recognition model using the TIMIT dataset. The model is based on Simple
Recurrent Unit (SRU) due to its considerable speedup over other recurrent
units. We applied our method to run on two platforms: SiLago and Bitfusion.
Experimental evaluations showed that SRU can be compressed up to 8x by
post-training quantization without any significant increase in the error and up
to 12x with only a 1.5 percentage point increase in error. On SiLago, the
inference-only search found solutions that achieve 80\% and 64\% of the maximum
possible speedup and energy saving, respectively, with a 0.5 percentage point
increase in the error. On Bitfusion, with a constraint of a small SRAM size,
beacon-based search reduced the error gain of inference-only search by 4
percentage points and increased the possible reached speedup to be 47x compared
to the Bitfusion baseline.
Related papers
- Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices.
Common approaches to address this issue are pruning and mixed-precision quantization.
We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z) - Improved Sparse Ising Optimization [0.0]
This report presents new data demonstrating significantly higher performance on some longstanding benchmark problems with up to 20,000 variables.
Relative to leading reported combinations of speed and accuracy, a proof-of-concept implementation reached targets 2-4 orders of magnitude faster.
The data suggest exciting possibilities for pushing the sparse Ising performance frontier to potentially strengthen algorithm portfolios, AI toolkits and decision-making systems.
arXiv Detail & Related papers (2023-11-15T17:59:06Z) - FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - SPDY: Accurate Pruning with Speedup Guarantees [29.284147465251685]
SPDY is a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup.
We show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios.
We also extend our approach to the recently-proposed task of pruning with very little data, where we achieve the best known accuracy recovery when pruning to the GPU-supported 2:4 sparsity pattern.
arXiv Detail & Related papers (2022-01-31T10:14:31Z) - An Information Theory-inspired Strategy for Automatic Network Pruning [88.51235160841377]
Deep convolution neural networks are well known to be compressed on devices with resource constraints.
Most existing network pruning methods require laborious human efforts and prohibitive computation resources.
We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Effective and Fast: A Novel Sequential Single Path Search for
Mixed-Precision Quantization [45.22093693422085]
Mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance.
It is a difficult problem to quickly determine the quantization bit-precision of each layer in deep neural networks according to some constraints.
We propose a novel sequential single path search (SSPS) method for mixed-precision quantization.
arXiv Detail & Related papers (2021-03-04T09:15:08Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse
Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem.
In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search.
Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z) - Automatic heterogeneous quantization of deep neural networks for
low-latency inference on the edge for particle detectors [5.609098985493794]
We introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip.
This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of $mathcal O(1)mu$s is required.
arXiv Detail & Related papers (2020-06-15T15:07:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.