LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions
- URL: http://arxiv.org/abs/2405.13046v1
- Date: Sat, 18 May 2024 22:23:07 GMT
- Title: LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions
- Authors: Victor Agostinelli, Sanghyun Hong, Lizhong Chen,
- Abstract summary: State-of-the-art re-weighting functions rely heavily on target sequence lengths.
We propose Learned Proportions (LeaP) and LeaPformers.
- Score: 7.478336691707095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A promising approach to preserving model performance in linearized transformers is to employ position-based re-weighting functions. However, state-of-the-art re-weighting functions rely heavily on target sequence lengths, making it difficult or impossible to apply them to autoregressive and simultaneous tasks, where the target and sometimes even the input sequence length are unknown. To address this issue, we propose Learned Proportions (LeaP) and LeaPformers. Our contribution is built on two major components. First, we generalize the dependence on explicit positional representations and sequence lengths into dependence on sequence proportions for re-weighting. Second, we replace static positional representations with dynamic proportions derived via a compact module, enabling more flexible attention concentration patterns. We evaluate LeaPformer against eight representative efficient transformers on the Long-Range Arena benchmark, showing that LeaPformer achieves the best quality-throughput trade-off, as well as LeaPformer to Wikitext-103 autoregressive language modeling and simultaneous speech-to-text translation for two language pairs, achieving competitive results.
Related papers
- Context-Scaling versus Task-Scaling in In-Context Learning [17.36757113301424]
We analyze two key components of In-Context Learning (ICL): context-scaling and task-scaling.
While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling.
arXiv Detail & Related papers (2024-10-16T17:58:08Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - When to Use Efficient Self Attention? Profiling Text, Speech and Image
Transformer Variants [39.00433193973159]
We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision.
We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models.
To conduct this analysis for speech, we introduce L-HuBERT, a novel local-attention variant of a self-supervised speech model.
arXiv Detail & Related papers (2023-06-14T17:59:02Z) - A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences.
We introduce a relative position embedding to explicitly maximize attention resolution.
We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.