Related papers: On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

URL: http://arxiv.org/abs/2106.01335v1
Date: Wed, 2 Jun 2021 17:45:47 GMT
Title: On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
Authors: Tianchu Ji, Shraddhan Jain, Michael Ferdman, Peter Milder, H. Andrew Schwartz, Niranjan Balasubramanian
Abstract summary: We study the full range of typical attention values necessary for NLP tasks. We find nearly 80% of attention values can be pruned to zeros with minimal ($ 1.0%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.
Score: 13.401707395755746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. $2^3$) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.

Related papers

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding [1.6112718683989882]
We introduce Top-Theta Attention, or simply Top-$theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy. Unlike top-k attention, Top-$theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search.
arXiv Detail & Related papers (2025-02-12T12:50:15Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
Sparse Binary Transformers for Multivariate Time Series Modeling [1.3965477771846404]
We show that lightweight Compressed Neural Networks can achieve accuracy comparable to dense floating-point Transformers. Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting. We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count.
arXiv Detail & Related papers (2023-08-09T00:23:04Z)
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error. We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z)
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference. This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z)
Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models. This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z)
MARViN -- Multiple Arithmetic Resolutions Vacillating in Neural Networks [0.0]
We introduce MARViN, a new quantized training strategy using information theory-based intra-epoch precision switching. We achieve an average speedup of 1.86 compared to a float32 basis while limiting mean degradation accuracy on AlexNet/ResNet to only -0.075%.
arXiv Detail & Related papers (2021-07-28T16:57:05Z)
Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z)
How Low Can We Go: Trading Memory for Error in Low-Precision Training [52.94003953419242]
Low-precision arithmetic trains deep learning models using less energy, less memory and less time. We pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. We borrow ideas from meta-learning to learn the tradeoff between memory and error.
arXiv Detail & Related papers (2021-06-17T17:38:07Z)
DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks. We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z)
n-hot: Efficient bit-level sparsity for powers-of-two neural network quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware. PoT quantization triggers a severe accuracy drop because of its limited representation ability. We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z)
DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs. Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance. We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization [57.14179747713731]
We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy. With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits.
arXiv Detail & Related papers (2020-02-08T04:58:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.