Training Transformers with 4-bit Integers
- URL: http://arxiv.org/abs/2306.11987v2
- Date: Thu, 22 Jun 2023 20:09:02 GMT
- Title: Training Transformers with 4-bit Integers
- Authors: Haocheng Xi, Changhao Li, Jianfei Chen, and Jun Zhu
- Abstract summary: Quantizing activation, weight, and gradient to 4-bit is promising to accelerate neural network training.
Existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware.
In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic.
- Score: 21.861232105539933
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantizing the activation, weight, and gradient to 4-bit is promising to
accelerate neural network training. However, existing 4-bit training methods
require custom numerical formats which are not supported by contemporary
hardware. In this work, we propose a training method for transformers with all
matrix multiplications implemented with the INT4 arithmetic. Training with an
ultra-low INT4 precision is challenging. To achieve this, we carefully analyze
the specific structures of activation and gradients in transformers to propose
dedicated quantizers for them. For forward propagation, we identify the
challenge of outliers and propose a Hadamard quantizer to suppress the
outliers. For backpropagation, we leverage the structural sparsity of gradients
by proposing bit splitting and leverage score sampling techniques to quantize
gradients accurately. Our algorithm achieves competitive accuracy on a wide
range of tasks including natural language understanding, machine translation,
and image classification. Unlike previous 4-bit training methods, our algorithm
can be implemented on the current generation of GPUs. Our prototypical linear
operator implementation is up to 2.2 times faster than the FP16 counterparts
and speeds up the training by up to 35.1%.
Related papers
- Gradient-Free Neural Network Training on the Edge [12.472204825917629]
Training neural networks is computationally heavy and energy-intensive.
This work presents a novel technique for training neural networks without needing gradient.
We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification.
arXiv Detail & Related papers (2024-10-13T05:38:39Z) - Accelerating Transformer Pre-training with 2:4 Sparsity [19.64391647966267]
NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent.
We propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality.
Our algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
arXiv Detail & Related papers (2024-04-02T11:12:42Z) - Hadamard Domain Training with Integers for Class Incremental Quantized
Learning [1.4416751609100908]
Continual learning can be cost-prohibitive for resource-constraint edge platforms.
We propose a technique that transforms to enable low-precision training with only integer matrix multiplications.
We achieve less than 0.5% and 3% accuracy degradation while we quantize all matrix multiplications inputs down to 4-bits with 8-bit accumulators.
arXiv Detail & Related papers (2023-10-05T16:52:59Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training [31.515532976570643]
We show that single-level scaling is sufficient to maintain training accuracy while maximizing arithmetic density.
We propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations in training.
arXiv Detail & Related papers (2022-11-19T16:17:11Z) - Quantized Training of Gradient Boosting Decision Trees [84.97123593657584]
We propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm.
With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits.
We observe up to 2$times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets.
arXiv Detail & Related papers (2022-07-20T06:27:06Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Multi-Precision Policy Enforced Training (MuPPET): A precision-switching
strategy for quantised fixed-point training of CNNs [13.83645579871775]
Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks.
This work pushes the boundary of quantised training by employing a multilevel approach that utilises multiple precisions.
MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84$times$ and an average speedup of 1.58$times$ across the networks.
arXiv Detail & Related papers (2020-06-16T10:14:36Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.