Hardware-Centric AutoML for Mixed-Precision Quantization
- URL: http://arxiv.org/abs/2008.04878v1
- Date: Tue, 11 Aug 2020 17:30:22 GMT
- Title: Hardware-Centric AutoML for Mixed-Precision Quantization
- Authors: Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han
- Abstract summary: Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy.
Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.
- Score: 34.39845532939529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model quantization is a widely used technique to compress and accelerate deep
neural network (DNN) inference. Emergent DNN hardware accelerators begin to
support mixed precision (1-8 bits) to further improve the computation
efficiency, which raises a great challenge to find the optimal bitwidth for
each layer: it requires domain experts to explore the vast design space trading
off among accuracy, latency, energy, and model size, which is both
time-consuming and sub-optimal. Conventional quantization algorithm ignores the
different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ)
framework which leverages the reinforcement learning to automatically determine
the quantization policy, and we take the hardware accelerator's feedback in the
design loop. Rather than relying on proxy signals such as FLOPs and model size,
we employ a hardware simulator to generate direct feedback signals (latency and
energy) to the RL agent. Compared with conventional methods, our framework is
fully automated and can specialize the quantization policy for different neural
network architectures and hardware architectures. Our framework effectively
reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with
negligible loss of accuracy compared with the fixed bitwidth (8 bits)
quantization. Our framework reveals that the optimal policies on different
hardware architectures (i.e., edge and cloud architectures) under different
resource constraints (i.e., latency, energy, and model size) are drastically
different. We interpreted the implication of different quantization policies,
which offer insights for both neural network architecture design and hardware
architecture design.
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - Quantization of Deep Neural Networks to facilitate self-correction of
weights on Phase Change Memory-based analog hardware [0.0]
We develop an algorithm to approximate a set of multiplicative weights.
These weights aim to represent the original network's weights with minimal loss in performance.
Our results demonstrate that, when paired with an on-chip pulse generator, our self-correcting neural network performs comparably to those trained with analog-aware algorithms.
arXiv Detail & Related papers (2023-09-30T10:47:25Z) - Biologically Plausible Learning on Neuromorphic Hardware Architectures [27.138481022472]
Neuromorphic computing is an emerging paradigm that confronts this imbalance by computations directly in analog memories.
This work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa.
arXiv Detail & Related papers (2022-12-29T15:10:59Z) - HQNAS: Auto CNN deployment framework for joint quantization and
architecture search [30.45926484863791]
We propose a novel neural network design framework called Hardware-aware Quantized Neural Architecture Search(HQNAS)
It takes only 4 GPU hours to discover an outstanding NN policy on CIFAR10.
It also takes only %10 GPU time to generate a comparable model on Imagenet.
arXiv Detail & Related papers (2022-10-16T08:32:18Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Reconfigurable co-processor architecture with limited numerical
precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc.
Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs.
In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - Learned Hardware/Software Co-Design of Neural Accelerators [20.929918108940093]
Deep learning software stacks and hardware accelerators are diverse and vast.
Prior work considers software optimizations separately from hardware architectures, effectively reducing the search space.
This paper casts the problem as hardware/software co-design, with the goal of automatically identifying desirable points in the joint design space.
arXiv Detail & Related papers (2020-10-05T15:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.