Treeformer: Dense Gradient Trees for Efficient Attention Computation
- URL: http://arxiv.org/abs/2208.09015v1
- Date: Thu, 18 Aug 2022 18:31:40 GMT
- Title: Treeformer: Dense Gradient Trees for Efficient Attention Computation
- Authors: Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain
- Abstract summary: We show how to speed up attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels.
Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention.
We demonstrate that our Treeformer architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer.
- Score: 24.045251327736814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard inference and training with transformer based architectures scale
quadratically with input sequence length. This is prohibitively large for a
variety of applications especially in web-page translation, query-answering
etc. Consequently, several approaches have been developed recently to speedup
attention computation by enforcing different attention structures such as
sparsity, low-rank, approximating attention using kernels. In this work, we
view attention computation as that of nearest neighbor retrieval, and use
decision tree based hierarchical navigation to reduce the retrieval cost per
query token from linear in sequence length to nearly logarithmic. Based on such
hierarchical navigation, we design Treeformer which can use one of two
efficient attention layers -- TF-Attention and TC-Attention. TF-Attention
computes the attention in a fine-grained style, while TC-Attention is a coarse
attention layer which also ensures that the gradients are "dense". To optimize
such challenging discrete layers, we propose a two-level bootstrapped training
method. Using extensive experiments on standard NLP benchmarks, especially for
long-sequences, we demonstrate that our Treeformer architecture can be almost
as accurate as baseline Transformer while using 30x lesser FLOPs in the
attention layer. Compared to Linformer, the accuracy can be as much as 12%
higher while using similar FLOPs in the attention layer.
Related papers
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters [10.403248386029407]
Self-attention is a significant computational bottleneck due to its complexity in the sequence length.
In this work, we derive the scalar energy function whose gradient computes the self-attention block.
Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction.
arXiv Detail & Related papers (2024-08-07T21:16:55Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention.
Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module.
An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z) - Learning strides in convolutional neural networks [34.20666933112202]
This work introduces DiffStride, the first downsampling layer with learnable strides.
Experiments on audio and image classification show the generality and effectiveness of our solution.
arXiv Detail & Related papers (2022-02-03T16:03:36Z) - Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth)
It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer.
This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$.
We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z) - Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score.
LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.