Do Long-Range Language Models Actually Use Long-Range Context?
- URL: http://arxiv.org/abs/2109.09115v1
- Date: Sun, 19 Sep 2021 12:49:43 GMT
- Title: Do Long-Range Language Models Actually Use Long-Range Context?
- Authors: Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, Mohit Iyyer
- Abstract summary: Language models are generally trained on short, truncated input sequences.
Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models.
- Score: 27.084888397778823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models are generally trained on short, truncated input sequences,
which limits their ability to use discourse-level information present in
long-range context to improve their predictions. Recent efforts to improve the
efficiency of self-attention have led to a proliferation of long-range
Transformer language models, which can process much longer sequences than
models of the past. However, the ways in which such models take advantage of
the long-range context remain unclear. In this paper, we perform a fine-grained
analysis of two long-range Transformer language models (including the
\emph{Routing Transformer}, which achieves state-of-the-art perplexity on the
PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to
8K tokens. Our results reveal that providing long-range context (i.e., beyond
the previous 2K tokens) to these models only improves their predictions on a
small set of tokens (e.g., those that can be copied from the distant context)
and does not help at all for sentence-level prediction tasks. Finally, we
discover that PG-19 contains a variety of different document types and domains,
and that long-range context helps most for literary novels (as opposed to
textbooks or magazines).
Related papers
- Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Long-Range Transformer Architectures for Document Understanding [1.9331361036118608]
Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.
We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences.
Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
arXiv Detail & Related papers (2023-09-11T14:45:24Z) - YaRN: Efficient Context Window Extension of Large Language Models [1.024113475677323]
Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models.
We present YaRN, a compute-efficient method to extend the context window of such models.
We show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow.
arXiv Detail & Related papers (2023-08-31T18:18:07Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - HiPool: Modeling Long Documents Using Graph Neural Networks [24.91040673099863]
Long sequences in Natural Language Processing (NLP) are a challenging problem.
Recent pretraining language models achieve satisfying performances in many NLP tasks.
We propose a new challenging benchmark, totaling six datasets with up to 53k samples and 4034 average tokens' length.
arXiv Detail & Related papers (2023-05-05T06:58:24Z) - Finding the Needle in a Haystack: Unsupervised Rationale Extraction from
Long Text Classifiers [20.10172411803626]
We propose a compositional soft attention architecture that applies RoBERTa sentence-wise to extract plausible rationales at the token-level.
We find this method to significantly outperform Longformer-driven baselines on sentiment classification datasets.
arXiv Detail & Related papers (2023-03-14T15:45:35Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Modeling Context With Linear Attention for Scalable Document-Level
Translation [72.41955536834702]
We investigate the efficacy of a recent linear attention model on document translation and augment it with a sentential gate to promote a recency inductive bias.
We show that sentential gating further improves translation quality on IWSLT.
arXiv Detail & Related papers (2022-10-16T03:41:50Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.