Related papers: Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

URL: http://arxiv.org/abs/2311.00684v2
Date: Wed, 15 Nov 2023 15:55:02 GMT
Title: Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation
Authors: Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky
Abstract summary: An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. We propose two attention alignment strategies via temperature scaling to alleviate the issue.
Score: 61.305218287797025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.

Related papers

Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech [9.982121768809854]
We introduce enhancements aimed at AR Transformer-based encoder-decoder text-to-speech systems. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system.
arXiv Detail & Related papers (2024-10-29T16:17:01Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
CoLT5: Faster Long-Range Transformers with Conditional Computation [65.83586041097763]
We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference. CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
arXiv Detail & Related papers (2023-03-17T03:28:17Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis [72.71398034617607]
We dissect a relative positional embedding design, ALiBi, via the lens of receptive field analysis. We modify the vanilla Sinusoidal positional embedding to create bftext, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence.
arXiv Detail & Related papers (2022-12-20T15:40:17Z)
Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way. We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z)
LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z)
Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.