BatchQuant: Quantized-for-all Architecture Search with Robust Quantizer
- URL: http://arxiv.org/abs/2105.08952v1
- Date: Wed, 19 May 2021 06:56:43 GMT
- Title: BatchQuant: Quantized-for-all Architecture Search with Robust Quantizer
- Authors: Haoping Bai, Meng Cao, Ping Huang, Jiulong Shan
- Abstract summary: BatchQuant is a robust quantizer formulation that allows fast and stable training of a compact, single-shot, mixed-precision, weight-sharing supernet.
We demonstrate the effectiveness of our method on ImageNet and achieve SOTA Top-1 accuracy under a low complexity constraint.
- Score: 10.483508279350195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the applications of deep learning models on edge devices increase at an
accelerating pace, fast adaptation to various scenarios with varying resource
constraints has become a crucial aspect of model deployment. As a result, model
optimization strategies with adaptive configuration are becoming increasingly
popular. While single-shot quantized neural architecture search enjoys
flexibility in both model architecture and quantization policy, the combined
search space comes with many challenges, including instability when training
the weight-sharing supernet and difficulty in navigating the exponentially
growing search space. Existing methods tend to either limit the architecture
search space to a small set of options or limit the quantization policy search
space to fixed precision policies. To this end, we propose BatchQuant, a robust
quantizer formulation that allows fast and stable training of a compact,
single-shot, mixed-precision, weight-sharing supernet. We employ BatchQuant to
train a compact supernet (offering over $10^{76}$ quantized subnets) within
substantially fewer GPU hours than previous methods. Our approach,
Quantized-for-all (QFA), is the first to seamlessly extend one-shot
weight-sharing NAS supernet to support subnets with arbitrary ultra-low
bitwidth mixed-precision quantization policies without retraining. QFA opens up
new possibilities in joint hardware-aware neural architecture search and
quantization. We demonstrate the effectiveness of our method on ImageNet and
achieve SOTA Top-1 accuracy under a low complexity constraint ($<20$ MFLOPs).
The code and models will be made publicly available at
https://github.com/bhpfelix/QFA.
Related papers
- Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions [121.05720140641189]
We develop a family of models with the distilling neural architecture (DNA) techniques.
Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a sub- search space using algorithms.
Our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively.
arXiv Detail & Related papers (2024-03-02T22:16:47Z) - SimQ-NAS: Simultaneous Quantization Policy and Neural Architecture
Search [6.121126813817338]
Recent one-shot Neural Architecture Search algorithms rely on training a hardware-agnostic super-network tailored to a specific task and then extracting efficient sub-networks for different hardware platforms.
We show that by using multi-objective search algorithms paired with lightly trained predictors, we can efficiently search for both the sub-network architecture and the corresponding quantization policy.
arXiv Detail & Related papers (2023-12-19T22:08:49Z) - EQ-Net: Elastic Quantization Neural Networks [15.289359357583079]
Elastic Quantization Neural Networks (EQ-Net) aims to train a robust weight-sharing quantization supernet.
We propose an elastic quantization space (including elastic bit-width, granularity, and symmetry) to adapt to various mainstream quantitative forms.
We incorporate genetic algorithms and the proposed Conditional Quantization-Aware Conditional Accuracy Predictor (CQAP) as an estimator to quickly search mixed-precision quantized neural networks in supernet.
arXiv Detail & Related papers (2023-08-15T08:57:03Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - QuantNAS for super resolution: searching for efficient
quantization-friendly architectures against quantization noise [19.897685398009912]
We propose a novel quantization-aware procedure, the QuantNAS.
We use entropy regularization, quantization noise, and Adaptive Deviation for Quantization (ADQ) module to enhance the search procedure.
The proposed procedure is 30% faster than direct weight quantization and is more stable.
arXiv Detail & Related papers (2022-08-31T13:12:16Z) - Generalizing Few-Shot NAS with Gradient Matching [165.5690495295074]
One-Shot methods train one supernet to approximate the performance of every architecture in the search space via weight-sharing.
Few-Shot NAS reduces the level of weight-sharing by splitting the One-Shot supernet into multiple separated sub-supernets.
It significantly outperforms its Few-Shot counterparts while surpassing previous comparable methods in terms of the accuracy of derived architectures.
arXiv Detail & Related papers (2022-03-29T03:06:16Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - Hardware-Centric AutoML for Mixed-Precision Quantization [34.39845532939529]
Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy.
Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.
arXiv Detail & Related papers (2020-08-11T17:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.