I-BERT: Integer-only BERT Quantization
- URL: http://arxiv.org/abs/2101.01321v2
- Date: Thu, 11 Feb 2021 09:11:11 GMT
- Title: I-BERT: Integer-only BERT Quantization
- Authors: Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
- Abstract summary: We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
- Score: 78.43819756382103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer based models, like BERT and RoBERTa, have achieved
state-of-the-art results in many Natural Language Processing tasks. However,
their memory footprint, inference latency, and power consumption are
prohibitive for efficient inference at the edge, and even at the data center.
While quantization can be a viable solution for this, previous work on
quantizing Transformer based models use floating-point arithmetic during
inference, which cannot efficiently utilize integer-only logical units such as
the recent Turing Tensor Cores, or traditional integer-only ARM processors. In
this work, we propose I-BERT, a novel quantization scheme for Transformer based
models that quantizes the entire inference with integer-only arithmetic. Based
on lightweight integer-only approximation methods for nonlinear operations,
e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end
integer-only BERT inference without any floating point calculation. We evaluate
our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that
for both cases, I-BERT achieves similar (and slightly higher) accuracy as
compared to the full-precision baseline. Furthermore, our preliminary
implementation of I-BERT shows a speedup of 2.4 - 4.0x for INT8 inference on a
T4 GPU system as compared to FP32 inference. The framework has been developed
in PyTorch and has been open-sourced.
Related papers
- STAT: Shrinking Transformers After Training [72.0726371426711]
We present STAT, a simple algorithm to prune transformer models without any fine-tuning.
STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer.
Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU.
arXiv Detail & Related papers (2024-05-29T22:59:11Z) - Integer Fine-tuning of Transformer-based Models [13.383066080742699]
We study the effect of various integer bit-widths to find the minimum required bit-width for integer fine-tuning of transformer-based models.
We show that 16-bit integer models match the floating-point baseline performance.
Further reduction of the bit-width to 8 provides an average score drop of 1.7 points.
arXiv Detail & Related papers (2022-09-20T16:02:28Z) - I-ViT: Integer-only Quantization for Efficient Vision Transformer
Inference [3.067607520161916]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications.
These models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging.
We propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting.
arXiv Detail & Related papers (2022-07-04T13:37:38Z) - MKQ-BERT: Quantized BERT with 4-bits Weights and Activations [13.687982804234293]
We propose MKQ-BERT, which further improves the compression level and uses 4-bits for quantization.
We are the first work that successfully deploys the 4-bits BERT and achieves an end-to-end speedup for inference.
arXiv Detail & Related papers (2022-03-25T07:27:18Z) - Integer-arithmetic-only Certified Robustness for Quantized Neural
Networks [14.737638416823772]
A line of work on tackling adversarial examples is certified robustness via randomized smoothing.
Such a mechanism usually uses floating-point arithmetic for calculations in inference.
We show our approach can obtain a comparable accuracy and 4x5x speedup over floating-point arithmetic certified robust methods.
arXiv Detail & Related papers (2021-08-21T01:15:19Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.