Integer Fine-tuning of Transformer-based Models
- URL: http://arxiv.org/abs/2209.09815v1
- Date: Tue, 20 Sep 2022 16:02:28 GMT
- Title: Integer Fine-tuning of Transformer-based Models
- Authors: Mohammadreza Tayaranian, Alireza Ghaffari, Marzieh S. Tahaei, Mehdi
Rezagholizadeh, Masoud Asgharian, Vahid Partovi Nia
- Abstract summary: We study the effect of various integer bit-widths to find the minimum required bit-width for integer fine-tuning of transformer-based models.
We show that 16-bit integer models match the floating-point baseline performance.
Further reduction of the bit-width to 8 provides an average score drop of 1.7 points.
- Score: 13.383066080742699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer based models are used to achieve state-of-the-art performance on
various deep learning tasks. Since transformer-based models have large numbers
of parameters, fine-tuning them on downstream tasks is computationally
intensive and energy hungry. Automatic mixed-precision FP32/FP16 fine-tuning of
such models has been previously used to lower the compute resource
requirements. However, with the recent advances in the low-bit integer
back-propagation, it is possible to further reduce the computation and memory
foot-print. In this work, we explore a novel integer training method that uses
integer arithmetic for both forward propagation and gradient computation of
linear, convolutional, layer-norm, and embedding layers in transformer-based
models. Furthermore, we study the effect of various integer bit-widths to find
the minimum required bit-width for integer fine-tuning of transformer-based
models. We fine-tune BERT and ViT models on popular downstream tasks using
integer layers. We show that 16-bit integer models match the floating-point
baseline performance. Reducing the bit-width to 10, we observe 0.5 average
score drop. Finally, further reduction of the bit-width to 8 provides an
average score drop of 1.7 points.
Related papers
- Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead.
Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference.
We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - I-ViT: Integer-only Quantization for Efficient Vision Transformer
Inference [3.067607520161916]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications.
These models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging.
We propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting.
arXiv Detail & Related papers (2022-07-04T13:37:38Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.