Understanding and Overcoming the Challenges of Efficient Transformer
Quantization
- URL: http://arxiv.org/abs/2109.12948v1
- Date: Mon, 27 Sep 2021 10:57:18 GMT
- Title: Understanding and Overcoming the Challenges of Efficient Transformer
Quantization
- Authors: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
- Abstract summary: Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks.
However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices.
We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
- Score: 17.05322956052278
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based architectures have become the de-facto standard models for
a wide range of Natural Language Processing tasks. However, their memory
footprint and high latency are prohibitive for efficient deployment and
inference on resource-limited devices. In this work, we explore quantization
for transformers. We show that transformers have unique quantization challenges
-- namely, high dynamic activation ranges that are difficult to represent with
a low bit fixed-point format. We establish that these activations contain
structured outliers in the residual connections that encourage specific
attention patterns, such as attending to the special separator token. To combat
these challenges, we present three solutions based on post-training
quantization and quantization-aware training, each with a different set of
compromises for accuracy, model size, and ease of use. In particular, we
introduce a novel quantization scheme -- per-embedding-group quantization. We
demonstrate the effectiveness of our methods on the GLUE benchmark using BERT,
establishing state-of-the-art results for post-training quantization. Finally,
we show that transformer weights and embeddings can be quantized to ultra-low
bit-widths, leading to significant memory savings with a minimum accuracy loss.
Our source code is available
at~\url{https://github.com/qualcomm-ai-research/transformer-quantization}.
Related papers
- Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs [19.835810073852244]
This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs.
We introduce a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck.
We also develop a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies.
arXiv Detail & Related papers (2024-10-04T10:12:24Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers [7.155242379236052]
Quantization of Vision Transformers (ViTs) has emerged as a promising solution to mitigate these challenges.
Existing methods still suffer from significant accuracy loss at low-bit.
ADFQ-ViT provides significant improvements over various baselines in image classification, object detection, and instance segmentation tasks at 4-bit.
arXiv Detail & Related papers (2024-07-03T02:41:59Z) - RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models.
Existing PTQ methods typically exhibit non-trivial performance loss.
We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z) - Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision [45.69716658698776]
In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors.
We propose a variation-aware quantization scheme for both vision and language transformers.
Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement.
arXiv Detail & Related papers (2023-07-01T13:01:39Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
for Vision Transformers [53.85087932591237]
NoisyQuant is a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers.
Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution.
NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead.
arXiv Detail & Related papers (2022-11-29T10:02:09Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.