Investigating Efficiently Extending Transformers for Long Input
Summarization
- URL: http://arxiv.org/abs/2208.04347v1
- Date: Mon, 8 Aug 2022 18:10:58 GMT
- Title: Investigating Efficiently Extending Transformers for Long Input
Summarization
- Authors: Jason Phang, Yao Zhao, Peter J. Liu
- Abstract summary: We investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization.
We find that a staggered, block-local Transformer with global tokens strikes a good balance of performance and efficiency.
We introduce PEG-X, an extension of the PEG model with additional long input pretraining to handle inputs up to 16K tokens.
- Score: 37.622021824791254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large pretrained Transformer models have proven highly capable at
tackling natural language tasks, handling long sequence inputs continues to be
a significant challenge. One such task is long input summarization, where
inputs are longer than the maximum input context of most pretrained models.
Through an extensive set of experiments, we investigate what model
architectural changes and pretraining paradigms can most efficiently adapt a
pretrained Transformer for long input summarization. We find that a staggered,
block-local Transformer with global encoder tokens strikes a good balance of
performance and efficiency, and that an additional pretraining phase on long
sequences meaningfully improves downstream summarization performance. Based on
our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with
additional long input pretraining to handle inputs of up to 16K tokens.
PEGASUS-X achieves strong performance on long input summarization tasks
comparable with much larger models while adding few additional parameters and
not requiring model parallelism to train.
Related papers
- Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - CoLT5: Faster Long-Range Transformers with Conditional Computation [65.83586041097763]
We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference.
CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
arXiv Detail & Related papers (2023-03-17T03:28:17Z) - Transformer-based Models for Long-Form Document Matching: Challenges and
Empirical Analysis [12.269318291685753]
We show that simple neural models outperform the more complex BERT-based models.
Simple models are also more robust to variations in document length and text perturbations.
arXiv Detail & Related papers (2023-02-07T21:51:05Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Informer: Beyond Efficient Transformer for Long Sequence Time-Series
Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity.
Recent studies have shown the potential of Transformer to increase the prediction capacity.
We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z) - Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.