Generative Design of Hardware-aware DNNs
- URL: http://arxiv.org/abs/2006.03968v2
- Date: Sun, 12 Jul 2020 23:30:23 GMT
- Title: Generative Design of Hardware-aware DNNs
- Authors: Sheng-Chun Kao, Arun Ramamurthy, Tushar Krishna
- Abstract summary: We propose a new way for autonomous quantization and HW-aware tuning.
A generative model, AQGAN, takes a target accuracy as the condition and generates a suite of quantization configurations.
We evaluate our model on five of the widely-used efficient models on the ImageNet dataset.
- Score: 6.144349819246314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To efficiently run DNNs on the edge/cloud, many new DNN inference
accelerators are being designed and deployed frequently. To enhance the
resource efficiency of DNNs, model quantization is a widely-used approach.
However, different accelerator/HW has different resources leading to the need
for specialized quantization strategy of each HW. Moreover, using the same
quantization for every layer may be sub-optimal, increasing the designspace of
possible quantization choices. This makes manual-tuning infeasible. Recent work
in automatically determining quantization for each layer is driven by
optimization methods such as reinforcement learning. However, these approaches
need re-training the RL for every new HW platform. We propose a new way for
autonomous quantization and HW-aware tuning. We propose a generative model,
AQGAN, which takes a target accuracy as the condition and generates a suite of
quantization configurations. With the conditional generative model, the user
can autonomously generate different configurations with different targets in
inference time. Moreover, we propose a simplified HW-tuning flow, which uses
the generative model to generate proposals and execute simple selection based
on the HW resource budget, whose process is fast and interactive. We evaluate
our model on five of the widely-used efficient models on the ImageNet dataset.
We compare with existing uniform quantization and state-of-the-art autonomous
quantization methods. Our generative model shows competitive achieved accuracy,
however, with around two degrees less search cost for each design point. Our
generative model shows the generated quantization configuration can lead to
less than 3.5% error across all experiments.
Related papers
- AdaQAT: Adaptive Bit-Width Quantization-Aware Training [0.873811641236639]
Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios.
Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging.
We present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimize bit-widths during training for more efficient inference.
arXiv Detail & Related papers (2024-04-22T09:23:56Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - A Model for Every User and Budget: Label-Free and Personalized
Mixed-Precision Quantization [23.818922559567994]
We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain.
MyQASR generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning.
Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.
arXiv Detail & Related papers (2023-07-24T10:03:28Z) - AutoQNN: An End-to-End Framework for Automatically Quantizing Neural
Networks [6.495218751128902]
We propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor.
QPL is the first method to learn mixed-precision policies by re parameterizing the bitwidths of quantizing schemes.
QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention.
arXiv Detail & Related papers (2023-04-07T11:14:21Z) - A Framework for Demonstrating Practical Quantum Advantage: Racing
Quantum against Classical Generative Models [62.997667081978825]
We build over a proposed framework for evaluating the generalization performance of generative models.
We establish the first comparative race towards practical quantum advantage (PQA) between classical and quantum generative models.
Our results suggest that QCBMs are more efficient in the data-limited regime than the other state-of-the-art classical generative models.
arXiv Detail & Related papers (2023-03-27T22:48:28Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - One Model for All Quantization: A Quantized Network Supporting Hot-Swap
Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths.
We use wavelet decomposition and reconstruction to increase the diversity of weights.
Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - Conditional Generative Modeling via Learning the Latent Space [54.620761775441046]
We propose a novel framework for conditional generation in multimodal spaces.
It uses latent variables to model generalizable learning patterns.
At inference, the latent variables are optimized to find optimal solutions corresponding to multiple output modes.
arXiv Detail & Related papers (2020-10-07T03:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.