Attention Alignment and Flexible Positional Embeddings Improve
Transformer Length Extrapolation
- URL: http://arxiv.org/abs/2311.00684v2
- Date: Wed, 15 Nov 2023 15:55:02 GMT
- Title: Attention Alignment and Flexible Positional Embeddings Improve
Transformer Length Extrapolation
- Authors: Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky
- Abstract summary: An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning.
We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns.
We propose two attention alignment strategies via temperature scaling to alleviate the issue.
- Score: 61.305218287797025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An ideal length-extrapolatable Transformer language model can handle
sequences longer than the training length without any fine-tuning. Such
long-context utilization capability relies heavily on a flexible positional
embedding design. Upon investigating the flexibility of existing large
pre-trained Transformer language models, we find that the T5 family deserves a
closer look, as its positional embeddings capture rich and flexible attention
patterns. However, T5 suffers from the dispersed attention issue: the longer
the input sequence, the flatter the attention distribution. To alleviate the
issue, we propose two attention alignment strategies via temperature scaling.
Our findings show improvement on the long-context utilization capability of T5
on language modeling, retrieval, multi-document question answering, and code
completion tasks without any fine-tuning. This suggests that a flexible
positional embedding design and attention alignment can go a long way toward
Transformer length extrapolation.
Related papers
- Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech [9.982121768809854]
We introduce enhancements aimed at AR Transformer-based encoder-decoder text-to-speech systems.
Our approach uses an alignment mechanism to provide cross-attention operations with relative location information.
A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system.
arXiv Detail & Related papers (2024-10-29T16:17:01Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - CoLT5: Faster Long-Range Transformers with Conditional Computation [65.83586041097763]
We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference.
CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
arXiv Detail & Related papers (2023-03-17T03:28:17Z) - A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences.
We introduce a relative position embedding to explicitly maximize attention resolution.
We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z) - Dissecting Transformer Length Extrapolation via the Lens of Receptive
Field Analysis [72.71398034617607]
We dissect a relative positional embedding design, ALiBi, via the lens of receptive field analysis.
We modify the vanilla Sinusoidal positional embedding to create bftext, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence.
arXiv Detail & Related papers (2022-12-20T15:40:17Z) - LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time.
We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z) - Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime
with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training.
We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget.
We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.