On the Distribution, Sparsity, and Inference-time Quantization of
Attention Values in Transformers
- URL: http://arxiv.org/abs/2106.01335v1
- Date: Wed, 2 Jun 2021 17:45:47 GMT
- Title: On the Distribution, Sparsity, and Inference-time Quantization of
Attention Values in Transformers
- Authors: Tianchu Ji, Shraddhan Jain, Michael Ferdman, Peter Milder, H. Andrew
Schwartz, Niranjan Balasubramanian
- Abstract summary: We study the full range of typical attention values necessary for NLP tasks.
We find nearly 80% of attention values can be pruned to zeros with minimal ($ 1.0%$) relative loss in accuracy.
We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.
- Score: 13.401707395755746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How much information do NLP tasks really need from a transformer's attention
mechanism at application-time (inference)? From recent work, we know that there
is sparsity in transformers and that the floating-points within its computation
can be discretized to fewer values with minimal loss to task accuracies.
However, this requires retraining or even creating entirely new models, both of
which can be expensive and carbon-emitting. Focused on optimizations that do
not require training, we systematically study the full range of typical
attention values necessary. This informs the design of an inference-time
quantization technique using both pruning and log-scaled mapping which produces
only a few (e.g. $2^3$) unique values. Over the tasks of question answering and
sentiment analysis, we find nearly 80% of attention values can be pruned to
zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning
technique in conjunction with quantizing the attention values to only a 3-bit
format, without retraining, resulting in only a 0.8% accuracy reduction on
question answering with fine-tuned RoBERTa.
Related papers
- Sparse Binary Transformers for Multivariate Time Series Modeling [1.3965477771846404]
We show that lightweight Compressed Neural Networks can achieve accuracy comparable to dense floating-point Transformers.
Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting.
We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count.
arXiv Detail & Related papers (2023-08-09T00:23:04Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models.
This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z) - MARViN -- Multiple Arithmetic Resolutions Vacillating in Neural Networks [0.0]
We introduce MARViN, a new quantized training strategy using information theory-based intra-epoch precision switching.
We achieve an average speedup of 1.86 compared to a float32 basis while limiting mean degradation accuracy on AlexNet/ResNet to only -0.075%.
arXiv Detail & Related papers (2021-07-28T16:57:05Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - How Low Can We Go: Trading Memory for Error in Low-Precision Training [52.94003953419242]
Low-precision arithmetic trains deep learning models using less energy, less memory and less time.
We pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error.
We borrow ideas from meta-learning to learn the tradeoff between memory and error.
arXiv Detail & Related papers (2021-06-17T17:38:07Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - n-hot: Efficient bit-level sparsity for powers-of-two neural network
quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware.
PoT quantization triggers a severe accuracy drop because of its limited representation ability.
We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization [57.14179747713731]
We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy.
With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits.
arXiv Detail & Related papers (2020-02-08T04:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.