Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
- URL: http://arxiv.org/abs/2310.02980v4
- Date: Sun, 28 Apr 2024 13:52:55 GMT
- Title: Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
- Authors: Ido Amos, Jonathan Berant, Ankit Gupta,
- Abstract summary: We show that pretraining with standard denoising objectives leads to dramatic gains across multiple architectures.
In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained.
- Score: 44.5740422079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using $\textit{only the downstream task data}$, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Subsequently, we analyze the utility of previously-proposed structured parameterizations for SSMs and show they become mostly redundant in the presence of data-driven initialization obtained through pretraining. Our work shows that, when evaluating different architectures on supervised tasks, incorporation of data-driven priors via pretraining is essential for reliable performance estimation, and can be done efficiently.
Related papers
- ImageNet-RIB Benchmark: Large Pre-Training Datasets Don't Guarantee Robustness after Fine-Tuning [30.422932548359952]
We introduce a new robust fine-tuning benchmark, ImageNet-RIB (Robustness Inheritance Benchmark)
The benchmark consists of related but distinct specialized (downstream) tasks.
We find that the continual learning methods, EWC and LwF maintain robustness after fine-tuning.
arXiv Detail & Related papers (2024-10-28T22:33:22Z) - Timer: Generative Pre-trained Transformers Are Large Time Series Models [83.03091523806668]
This paper aims at the early development of large time series models (LTSM)
During pre-training, we curate large-scale datasets with up to 1 billion time points.
To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task.
arXiv Detail & Related papers (2024-02-04T06:55:55Z) - Large Pre-trained time series models for cross-domain Time series analysis tasks [20.228846068418765]
We propose a novel method of textitadaptive segmentation that automatically identifies optimal dataset-specific segmentation strategy during pre-training.
This enables LPTM to perform similar to or better than domain-specific state-of-art model when fine-tuned to different downstream time-series analysis tasks and under zero-shot settings.
arXiv Detail & Related papers (2023-11-19T20:16:16Z) - A Transformer-based Framework For Multi-variate Time Series: A Remaining
Useful Life Prediction Use Case [4.0466311968093365]
This work proposed an encoder-transformer architecture-based framework for time series prediction.
We validated the effectiveness of the proposed framework on all four sets of the C-MAPPS benchmark dataset.
To enable the model awareness of the initial stages of the machine life and its degradation path, a novel expanding window method was proposed.
arXiv Detail & Related papers (2023-08-19T02:30:35Z) - Structured State Space Models for In-Context Reinforcement Learning [30.189834820419446]
Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks.
We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel.
We show that our modified architecture runs faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task.
arXiv Detail & Related papers (2023-03-07T15:32:18Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Efficiently Modeling Long Sequences with Structured State Spaces [15.456254157293836]
We propose a new sequence model based on a new parameterization for the fundamental state space model.
S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet.
arXiv Detail & Related papers (2021-10-31T03:32:18Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.