Exploring Transformer Extrapolation
- URL: http://arxiv.org/abs/2307.10156v2
- Date: Tue, 19 Dec 2023 08:02:03 GMT
- Title: Exploring Transformer Extrapolation
- Authors: Zhen Qin, Yiran Zhong, Hui Deng
- Abstract summary: Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training.
Previous research has shown that this property can be attained by using carefully designed Relative Positional corpora.
This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis.
- Score: 19.729619149887014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Length extrapolation has attracted considerable attention recently since it
allows transformers to be tested on longer sequences than those used in
training. Previous research has shown that this property can be attained by
using carefully designed Relative Positional Encodings (RPEs). While these
methods perform well on a variety of corpora, the conditions for length
extrapolation have yet to be investigated. This paper attempts to determine
what types of RPEs allow for length extrapolation through a thorough
mathematical and empirical analysis. We discover that a transformer is certain
to possess this property as long as the series that corresponds to the RPE's
exponential converges. Two practices are derived from the conditions and
examined in language modeling tasks on a variety of corpora. As a bonus from
the conditions, we derive a new Theoretical Receptive Field (TRF) to measure
the receptive field of RPEs without taking any training steps. Extensive
experiments are conducted on the Wikitext-103, Books, Github, and WikiBook
datasets to demonstrate the viability of our discovered conditions. We also
compare TRF to Empirical Receptive Field (ERF) across different models, showing
consistently matched trends on the aforementioned datasets. The code is
available at https://github.com/OpenNLPLab/Rpe.
Related papers
- Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Tilt your Head: Activating the Hidden Spatial-Invariance of Classifiers [0.7704032792820767]
Deep neural networks are applied in more and more areas of everyday life.
They still lack essential abilities, such as robustly dealing with spatially transformed input signals.
We propose a novel technique to emulate such an inference process for neural nets.
arXiv Detail & Related papers (2024-05-06T09:47:29Z) - Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding [40.289596031245374]
All Transformer-based models including large language models (LLMs) suffer from a preset length limit.
Numerous methods have emerged to enhance the length extrapolation of Transformers.
This survey aims to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.
arXiv Detail & Related papers (2023-12-28T14:42:24Z) - Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs [50.25683648762602]
We introduce Koopman VAE, a new generative framework that is based on a novel design for the model prior.
Inspired by Koopman theory, we represent the latent conditional prior dynamics using a linear map.
KoVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks.
arXiv Detail & Related papers (2023-10-04T07:14:43Z) - All Roads Lead to Rome? Exploring the Invariance of Transformers'
Representations [69.3461199976959]
We propose a model based on invertible neural networks, BERT-INN, to learn the Bijection Hypothesis.
We show the advantage of BERT-INN both theoretically and through extensive experiments.
arXiv Detail & Related papers (2023-05-23T22:30:43Z) - Rethink Long-tailed Recognition with Vision Transformers [18.73285611631722]
Vision Transformers (ViT) are hard to train with long-tailed data.
ViT learns generalized features in an unsupervised manner.
Predictive Distribution (PDC) is a novel metric for Long-Tailed Recognition.
arXiv Detail & Related papers (2023-02-28T03:36:48Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Causal Transformer for Estimating Counterfactual Outcomes [18.640006398066188]
Estimating counterfactual outcomes over time from observational data is relevant for many applications.
We develop a novel Causal Transformer for estimating counterfactual outcomes over time.
Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders.
arXiv Detail & Related papers (2022-04-14T22:40:09Z) - TACTiS: Transformer-Attentional Copulas for Time Series [76.71406465526454]
estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance.
We propose a versatile method that estimates joint distributions using an attention-based decoder.
We show that our model produces state-of-the-art predictions on several real-world datasets.
arXiv Detail & Related papers (2022-02-07T21:37:29Z) - Interpretable Feature Construction for Time Series Extrinsic Regression [0.028675177318965035]
In some application domains, it occurs that the target variable is numerical and the problem is known as time series extrinsic regression (TSER)
We suggest an extension of a Bayesian method for robust and interpretable feature construction and selection in the context of TSER.
Our approach exploits a relational way to tackle with TSER: (i), we build various and simple representations of the time series which are stored in a relational data scheme, then, (ii), a propositionalisation technique is applied to build interpretable features from secondary tables to "flatten" the data.
arXiv Detail & Related papers (2021-03-15T08:12:19Z) - Fork or Fail: Cycle-Consistent Training with Many-to-One Mappings [67.11712279612583]
Cycle-consistent training is widely used for learning a forward and inverse mapping between two domains of interest.
We develop a conditional variational autoencoder (CVAE) approach that can be viewed as converting surjective mappings to implicit bijections.
Our pipeline can capture such many-to-one mappings during cycle training while promoting graph-to-text diversity.
arXiv Detail & Related papers (2020-12-14T10:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.