Sequence Length is a Domain: Length-based Overfitting in Transformer
Models
- URL: http://arxiv.org/abs/2109.07276v1
- Date: Wed, 15 Sep 2021 13:25:19 GMT
- Title: Sequence Length is a Domain: Length-based Overfitting in Transformer
Models
- Authors: Du\v{s}an Vari\v{s} and Ond\v{r}ej Bojar
- Abstract summary: In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based sequence-to-sequence architectures, while achieving
state-of-the-art results on a large number of NLP tasks, can still suffer from
overfitting during training. In practice, this is usually countered either by
applying regularization methods (e.g. dropout, L2-regularization) or by
providing huge amounts of training data. Additionally, Transformer and other
architectures are known to struggle when generating very long sequences. For
example, in machine translation, the neural-based systems perform worse on very
long sequences when compared to the preceding phrase-based translation
approaches (Koehn and Knowles, 2017).
We present results which suggest that the issue might also be in the mismatch
between the length distributions of the training and validation data combined
with the aforementioned tendency of the neural networks to overfit to the
training data. We demonstrate on a simple string editing task and a machine
translation task that the Transformer model performance drops significantly
when facing sequences of length diverging from the length distribution in the
training data. Additionally, we show that the observed drop in performance is
due to the hypothesis length corresponding to the lengths seen by the model
during training rather than the length of the input sequence.
Related papers
- Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures [46.58170057001437]
We introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences.
We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts.
arXiv Detail & Related papers (2024-05-31T14:00:44Z) - Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique.
We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach.
Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - Addressing the Length Bias Problem in Document-Level Neural Machine
Translation [29.590471092149375]
Document-level neural machine translation (DNMT) has shown promising results by incorporating more context information.
DNMT suffers from significant translation quality degradation when decoding documents that are much shorter or longer than the maximum sequence length.
We propose to improve the DNMT model in training method, attention mechanism, and decoding strategy.
arXiv Detail & Related papers (2023-11-20T08:29:52Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - Causal Transformer for Estimating Counterfactual Outcomes [18.640006398066188]
Estimating counterfactual outcomes over time from observational data is relevant for many applications.
We develop a novel Causal Transformer for estimating counterfactual outcomes over time.
Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders.
arXiv Detail & Related papers (2022-04-14T22:40:09Z) - ChunkFormer: Learning Long Time Series with Multi-stage Chunked
Transformer [0.0]
Original Transformer-based models adopt an attention mechanism to discover global information along a sequence.
ChunkFormer splits the long sequences into smaller sequence chunks for the attention calculation.
In this way, the proposed model gradually learns both local and global information without changing the total length of the input sequences.
arXiv Detail & Related papers (2021-12-30T15:06:32Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime
with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training.
We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget.
We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z) - On Long-Tailed Phenomena in Neural Machine Translation [50.65273145888896]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens.
We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.
We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy.
arXiv Detail & Related papers (2020-10-10T07:00:57Z) - Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech
Recognition [66.47000813920617]
We propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition.
The proposed model can accurately predict the length of the target sequence and achieve a competitive performance.
The model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.
arXiv Detail & Related papers (2020-05-16T08:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.