FrostNet: Towards Quantization-Aware Network Architecture Search
- URL: http://arxiv.org/abs/2006.09679v4
- Date: Mon, 30 Nov 2020 10:09:33 GMT
- Title: FrostNet: Towards Quantization-Aware Network Architecture Search
- Authors: Taehoon Kim, YoungJoon Yoo, Jihoon Yang
- Abstract summary: We present a new network architecture search (NAS) procedure to find a network that guarantees both full-precision (FLOAT32) and quantized (INT8) performances.
Our FrostNets achieve higher recognition accuracy than existing CNNs with comparable latency when quantized.
- Score: 8.713741951284886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: INT8 quantization has become one of the standard techniques for deploying
convolutional neural networks (CNNs) on edge devices to reduce the memory and
computational resource usages. By analyzing quantized performances of existing
mobile-target network architectures, we can raise an issue regarding the
importance of network architecture for optimal INT8 quantization. In this
paper, we present a new network architecture search (NAS) procedure to find a
network that guarantees both full-precision (FLOAT32) and quantized (INT8)
performances. We first propose critical but straightforward optimization method
which enables quantization-aware training (QAT) : floating-point statistic
assisting (StatAssist) and stochastic gradient boosting (GradBoost). By
integrating the gradient-based NAS with StatAssist and GradBoost, we discovered
a quantization-efficient network building block, Frost bottleneck. Furthermore,
we used Frost bottleneck as the building block for hardware-aware NAS to obtain
quantization-efficient networks, FrostNets, which show improved quantization
performances compared to other mobile-target networks while maintaining
competitive FLOAT32 performance. Our FrostNets achieve higher recognition
accuracy than existing CNNs with comparable latency when quantized, due to
higher latency reduction rate (average 65%).
Related papers
- Scaling Up Quantization-Aware Neural Architecture Search for Efficient
Deep Learning on the Edge [3.1878884714257008]
We present an approach to enable QA-NAS (INT8 and FB-MP) on large-scale tasks by leveraging the block-wise formulation introduced by block-wise NAS.
We demonstrate strong results for the semantic segmentation task on the Cityscapes dataset, finding FB-MP models 33% smaller and INT8 models 17.6% faster than DeepLabV3 (INT8) without compromising task performance.
arXiv Detail & Related papers (2024-01-22T20:32:31Z) - SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8
Inference [15.94147346105013]
We propose SpaceEvo, an automatic method for designing a dedicated, quantization-friendly search space for each target hardware.
We show that SpaceEvo consistently outperforms existing manually-designed search spaces with up to 2.5x faster speed while achieving the same accuracy.
arXiv Detail & Related papers (2023-03-15T01:41:21Z) - FxP-QNet: A Post-Training Quantizer for the Design of Mixed
Low-Precision DNNs with Dynamic Fixed-Point Representation [2.4149105714758545]
We propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet)
FxP-QNet adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements.
Results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16x, 10.36x, and 6.44x with less than 0.95%, 0.95%, and 1.99%
arXiv Detail & Related papers (2022-03-22T23:01:43Z) - FLASH: Fast Neural Architecture Search with Hardware Optimization [7.263481020106725]
Neural architecture search (NAS) is a promising technique to design efficient and high-performance deep neural networks (DNNs)
This paper proposes FLASH, a very fast NAS methodology that co-optimizes the DNN accuracy and performance on a real hardware platform.
arXiv Detail & Related papers (2021-08-01T23:46:48Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Subtensor Quantization for Mobilenets [5.735035463793008]
Quantization for deep neural networks (DNN) have enabled developers to deploy models with less memory and more efficient low-power inference.
In this paper, we analyzed several root causes of quantization loss and proposed alternatives that do not rely on per-channel or training-aware approaches.
We evaluate the image classification task on ImageNet dataset, and our post-training quantized 8-bit inference top-1 accuracy in within 0.7% of the floating point version.
arXiv Detail & Related papers (2020-11-04T15:41:47Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud
Networks [73.78551758828294]
LC-NAS is able to find state-of-the-art architectures for point cloud classification with minimal computational cost.
We show how our searched architectures achieve any desired latency with a reasonably low drop in accuracy.
arXiv Detail & Related papers (2020-08-24T10:30:21Z) - FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining [65.39532971991778]
We present an accuracy predictor that scores architecture and training recipes jointly, guiding both sample selection and ranking.
We run fast evolutionary searches in just CPU minutes to generate architecture-recipe pairs for a variety of resource constraints.
FBNetV3 makes up a family of state-of-the-art compact neural networks that outperform both automatically and manually-designed competitors.
arXiv Detail & Related papers (2020-06-03T05:20:21Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.