Related papers: Accelerating Attention through Gradient-Based Learned Runtime Pruning

Accelerating Attention through Gradient-Based Learned Runtime Pruning

URL: http://arxiv.org/abs/2204.03227v2
Date: Fri, 8 Apr 2022 15:41:04 GMT
Title: Accelerating Attention through Gradient-Based Learned Runtime Pruning
Authors: Zheng Li and Soroush Ghodrati and Amir Yazdanbakhsh and Hadi Esmaeilzadeh and Mingu Kang
Abstract summary: Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models. This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
Score: 9.109136535767478
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9x and 3.9x speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (<0.2% degradation)

Related papers

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Sparse Binary Transformers for Multivariate Time Series Modeling [1.3965477771846404]
We show that lightweight Compressed Neural Networks can achieve accuracy comparable to dense floating-point Transformers. Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting. We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count.
arXiv Detail & Related papers (2023-08-09T00:23:04Z)
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation [89.88214896713846]
STA score considers two critical factors: temporal redundancy and semantic importance. We apply the STA module to off-the-shelf video Transformers and Videowins. Results: Kinetics-400 and Something-Something V2 achieve 30% overshelf reduction with a negligible 0.2% accuracy drop.
arXiv Detail & Related papers (2023-08-08T19:38:15Z)
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error. We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z)
End-to-end symbolic regression with transformers [20.172752966322214]
Symbolic magnitude regression is a difficult task which usually involves predicting the two-step procedure faster. We show that our model approaches the end-to-end approach Neural the constants as an informed Transformer.
arXiv Detail & Related papers (2022-04-22T06:55:43Z)
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation [25.375024028636663]
We propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate approximation. We empirically show that Scatterbrain can achieve 2.1x lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT. We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.
arXiv Detail & Related papers (2021-10-28T17:52:17Z)
On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers [13.401707395755746]
We study the full range of typical attention values necessary for NLP tasks. We find nearly 80% of attention values can be pruned to zeros with minimal ($ 1.0%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.
arXiv Detail & Related papers (2021-06-02T17:45:47Z)
DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks. We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z)
Iterative Refinement in the Continuous Space for Non-Autoregressive Neural Machine Translation [68.25872110275542]
We propose an efficient inference procedure for non-autoregressive machine translation. It iteratively refines translation purely in the continuous space. We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En.
arXiv Detail & Related papers (2020-09-15T15:30:14Z)
Predictive Coding Approximates Backprop along Arbitrary Computation Graphs [68.8204255655161]
We develop a strategy to translate core machine learning architectures into their predictive coding equivalents. Our models perform equivalently to backprop on challenging machine learning benchmarks. Our method raises the potential that standard machine learning algorithms could in principle be directly implemented in neural circuitry.
arXiv Detail & Related papers (2020-06-07T15:35:47Z)
Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading [96.48553941812366]
Lip-reading aims to infer the speech content from the lip movement sequence. Traditional learning process of seq2seq models suffers from two problems. We propose a novel pseudo-convolutional policy gradient (PCPG) based method to address these two problems.
arXiv Detail & Related papers (2020-03-09T09:12:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.