Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens
- URL: http://arxiv.org/abs/2305.04241v2
- Date: Sat, 27 May 2023 04:17:13 GMT
- Title: Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens
- Authors: Zhanpeng Zeng, Cole Hawkins, Mingyi Hong, Aston Zhang, Nikolaos
Pappas, Vikas Singh, Shuai Zheng
- Abstract summary: We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
- Score: 65.4435926060951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers are central in modern natural language processing and computer
vision applications. Despite recent works devoted to reducing the quadratic
cost of such models (as a function of the sequence length), dealing with ultra
long sequences (e.g., with more than 16K tokens) remains challenging.
Applications such as answering questions based on a book or summarizing a
scientific article are inefficient or infeasible. Here, we propose to
significantly improve the efficiency of Transformers for ultra long sequences,
by compressing the sequence into a much smaller representation at each layer.
Specifically, by exploiting the fact that in many tasks, only a small subset of
special tokens (we call VIP-tokens) are most relevant to the final prediction,
we propose a VIP-token centric compression (VCC) scheme which selectively
compresses the sequence based on their impact on approximating the
representation of the VIP-tokens. Compared with competitive baselines, our
algorithm is not only efficient (achieving more than $3\times$ efficiency gain
compared to baselines on 4K and 16K lengths), but also offers
competitive/better performance on a large number of tasks. Further, we show
that our algorithm scales to 128K tokens (or more) while consistently offering
accuracy improvement.
Related papers
- LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Learned Token Pruning for Transformers [39.181816379061374]
Learned Token Pruning () method reduces redundant tokens as the data passes through the different layers of a transformer.
We extensively test the performance of our approach on multiple GLUE tasks.
Preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 and Intel Haswell.
arXiv Detail & Related papers (2021-07-02T09:00:13Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with
Adaptive Sequence Length [40.35853878334764]
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition.
To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16.
We propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
arXiv Detail & Related papers (2021-05-31T16:04:10Z) - Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating
Self-Attention [60.043273122786005]
We propose Nystr"omformer -- a model that exhibits favorable scalability as a function of sequence length.
The scalability of Nystr"omformer enables application to longer sequences with thousands of tokens.
We perform evaluations on multiple downstream tasks on the GLUE benchmark and reviews with standard sequence length, and find that our Nystr"omformer performs comparably, or in a few cases, even slightly better, than standard Transformer.
arXiv Detail & Related papers (2021-02-07T20:06:59Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.