BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of
DNNs from Scratch
- URL: http://arxiv.org/abs/2112.13843v1
- Date: Fri, 24 Dec 2021 03:16:58 GMT
- Title: BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of
DNNs from Scratch
- Authors: Souvik Kundu, Shikai Wang, Qirui Sun, Peter A. Beerel, Massoud Pedram
- Abstract summary: This paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models.
It requires a single training iteration but does not need a pre-trained baseline.
Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4x fewer parameter bits with a negligible drop in accuracy.
- Score: 11.32458063021286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large DNNs with mixed-precision quantization can achieve ultra-high
compression while retaining high classification performance. However, because
of the challenges in finding an accurate metric that can guide the optimization
process, these methods either sacrifice significant performance compared to the
32-bit floating-point (FP-32) baseline or rely on a compute-expensive,
iterative training policy that requires the availability of a pre-trained
baseline. To address this issue, this paper presents BMPQ, a training method
that uses bit gradients to analyze layer sensitivities and yield
mixed-precision quantized models. BMPQ requires a single training iteration but
does not need a pre-trained baseline. It uses an integer linear program (ILP)
to dynamically adjust the precision of layers during training, subject to a
fixed hardware budget. To evaluate the efficacy of BMPQ, we conduct extensive
experiments with VGG16 and ResNet18 on CIFAR-10, CIFAR-100, and Tiny-ImageNet
datasets. Compared to the baseline FP-32 models, BMPQ can yield models that
have 15.4x fewer parameter bits with a negligible drop in accuracy. Compared to
the SOTA "during training", mixed-precision training scheme, our models are
2.1x, 2.2x, and 2.9x smaller, on CIFAR-10, CIFAR-100, and Tiny-ImageNet,
respectively, with an improved accuracy of up to 14.54%.
Related papers
- Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices.
Common approaches to address this issue are pruning and mixed-precision quantization.
We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Efficient and Robust Quantization-aware Training via Adaptive Coreset Selection [38.23587031169402]
Quantization-aware training (QAT) is a representative model compression method to reduce redundancy in weights and activations.
Most existing QAT methods require end-to-end training on the entire dataset.
We propose two metrics based on analysis of loss and gradient of quantized weights to quantify the importance of each sample during training.
arXiv Detail & Related papers (2023-06-12T16:20:36Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - Activation Density based Mixed-Precision Quantization for Energy
Efficient Neural Networks [2.666640112616559]
We propose an in-training quantization method for neural network models.
Our method calculates bit-width for each layer during training a mixed precision model with competitive accuracy.
We run experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on VGG19/ResNet18 architectures.
arXiv Detail & Related papers (2021-01-12T09:01:44Z) - Revisiting BFloat16 Training [30.99618783594963]
State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision.
Deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units.
arXiv Detail & Related papers (2020-10-13T05:38:07Z) - Search What You Want: Barrier Panelty NAS for Mixed Precision
Quantization [51.26579110596767]
We propose a novel Barrier Penalty based NAS (BP-NAS) for mixed precision quantization.
BP-NAS sets new state of the arts on both classification (Cifar-10, ImageNet) and detection (COCO)
arXiv Detail & Related papers (2020-07-20T12:00:48Z) - Multi-Precision Policy Enforced Training (MuPPET): A precision-switching
strategy for quantised fixed-point training of CNNs [13.83645579871775]
Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks.
This work pushes the boundary of quantised training by employing a multilevel approach that utilises multiple precisions.
MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84$times$ and an average speedup of 1.58$times$ across the networks.
arXiv Detail & Related papers (2020-06-16T10:14:36Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - ScopeFlow: Dynamic Scene Scoping for Optical Flow [94.42139459221784]
We propose to modify the common training protocols of optical flow.
The improvement is based on observing the bias in sampling challenging data.
We find that both regularization and augmentation should decrease during the training protocol.
arXiv Detail & Related papers (2020-02-25T09:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.