Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference
- URL: http://arxiv.org/abs/2212.05326v1
- Date: Sat, 10 Dec 2022 15:57:38 GMT
- Title: Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference
- Authors: Hai Wu, Ruifei He, Haoru Tan, Xiaojuan Qi and Kaibin Huang
- Abstract summary: We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
- Score: 57.42762335081385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although considerable progress has been obtained in neural network
quantization for efficient inference, existing methods are not scalable to
heterogeneous devices as one dedicated model needs to be trained, transmitted,
and stored for one specific hardware setting, incurring considerable costs in
model training and maintenance. In this paper, we study a new vertical-layered
representation of neural network weights for encapsulating all quantized models
into a single one. With this representation, we can theoretically achieve any
precision network for on-demand service while only needing to train and
maintain one model. To this end, we propose a simple once quantization-aware
training (QAT) scheme for obtaining high-performance vertical-layered models.
Our design incorporates a cascade downsampling mechanism which allows us to
obtain multiple quantized networks from one full precision source model by
progressively mapping the higher precision weights to their adjacent lower
precision counterparts. Then, with networks of different bit-widths from one
source model, multi-objective optimization is employed to train the shared
source model weights such that they can be updated simultaneously, considering
the performance of all networks. By doing this, the shared weights will be
optimized to balance the performance of different quantized models, thus making
the weights transferable among different bit widths. Experiments show that the
proposed vertical-layered representation and developed once QAT scheme are
effective in embodying multiple quantized networks into a single one and allow
one-time training, and it delivers comparable performance as that of quantized
models tailored to any specific bit-width. Code will be available.
Related papers
- Post-Training Quantization for Re-parameterization via Coarse & Fine
Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight.
We develop an improved KL metric to determine optimal quantization scales for activation.
For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z) - Probabilistic Weight Fixing: Large-scale training of neural network
weight uncertainties for quantization [7.2282857478457805]
Weight-sharing quantization has emerged as a technique to reduce energy expenditure during inference in large neural networks.
This paper proposes a probabilistic framework based on Bayesian neural networks (BNNs) and a variational relaxation to identify which weights can be moved to which cluster centre.
Our method outperforms the state-of-the-art quantization method top-1 accuracy by 1.6% on ImageNet using DeiT-Tiny.
arXiv Detail & Related papers (2023-09-24T08:04:28Z) - EQ-Net: Elastic Quantization Neural Networks [15.289359357583079]
Elastic Quantization Neural Networks (EQ-Net) aims to train a robust weight-sharing quantization supernet.
We propose an elastic quantization space (including elastic bit-width, granularity, and symmetry) to adapt to various mainstream quantitative forms.
We incorporate genetic algorithms and the proposed Conditional Quantization-Aware Conditional Accuracy Predictor (CQAP) as an estimator to quickly search mixed-precision quantized neural networks in supernet.
arXiv Detail & Related papers (2023-08-15T08:57:03Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - One Model for All Quantization: A Quantized Network Supporting Hot-Swap
Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths.
We use wavelet decomposition and reconstruction to increase the diversity of weights.
Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - FracBits: Mixed Precision Quantization via Fractional Bit-Widths [29.72454879490227]
Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths.
We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target computation constraints.
arXiv Detail & Related papers (2020-07-04T06:09:09Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.