Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
- URL: http://arxiv.org/abs/2110.15343v1
- Date: Thu, 28 Oct 2021 17:52:17 GMT
- Title: Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
- Authors: Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher
R\'e
- Abstract summary: We propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate approximation.
We empirically show that Scatterbrain can achieve 2.1x lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT.
We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.
- Score: 25.375024028636663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in efficient Transformers have exploited either the sparsity
or low-rank properties of attention matrices to reduce the computational and
memory bottlenecks of modeling long sequences. However, it is still challenging
to balance the trade-off between model quality and efficiency to perform a
one-size-fits-all approximation for different tasks. To better understand this
trade-off, we observe that sparse and low-rank approximations excel in
different regimes, determined by the softmax temperature in attention, and
sparse + low-rank can outperform each individually. Inspired by the classical
robust-PCA algorithm for sparse and low-rank decomposition, we propose
Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and
low-rank (via kernel feature map) attention for accurate and efficient
approximation. The estimation is unbiased with provably low error. We
empirically show that Scatterbrain can achieve 2.1x lower error than baselines
when serving as a drop-in replacement in BigGAN image generation and
pre-trained T2T-ViT. On a pre-trained T2T Vision transformer, even without
fine-tuning, Scatterbrain can reduce 98% of attention memory at the cost of
only 1% drop in accuracy. We demonstrate Scatterbrain for end-to-end training
with up to 4 points better perplexity and 5 points better average accuracy than
sparse or low-rank efficient transformers on language modeling and
long-range-arena tasks.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank
Compression Strategy [5.699098817569033]
This paper introduces an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs.
The presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset.
In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain.
arXiv Detail & Related papers (2024-02-08T19:01:14Z) - Sparse Binary Transformers for Multivariate Time Series Modeling [1.3965477771846404]
We show that lightweight Compressed Neural Networks can achieve accuracy comparable to dense floating-point Transformers.
Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting.
We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count.
arXiv Detail & Related papers (2023-08-09T00:23:04Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models.
This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models [4.247712017691596]
AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task.
Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
arXiv Detail & Related papers (2020-10-07T23:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.