Simple Local Attentions Remain Competitive for Long-Context Tasks
- URL: http://arxiv.org/abs/2112.07210v1
- Date: Tue, 14 Dec 2021 07:37:58 GMT
- Title: Simple Local Attentions Remain Competitive for Long-Context Tasks
- Authors: Wenhan Xiong, Barlas O\u{g}uz, Anchit Gupta, Xilun Chen, Diana
Liskovich, Omer Levy, Wen-tau Yih, Yashar Mehdad
- Abstract summary: Many NLP tasks require processing long contexts beyond the length limit of pretrained models.
In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed.
For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks.
- Score: 32.785459927278616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many NLP tasks require processing long contexts beyond the length limit of
pretrained models. In order to scale these models to longer text sequences,
many efficient long-range attention variants have been proposed. Despite the
abundance of research along this direction, it is still difficult to gauge the
relative effectiveness of these models in practical use cases, e.g., if we
apply these models following the pretrain-and-finetune paradigm. In this work,
we aim to conduct a thorough analysis of these emerging models with large-scale
and controlled experiments. For each attention variant, we pretrain large-size
models using the same long-doc corpus and then finetune these models for
real-world long-context tasks. Our findings reveal pitfalls of an existing
widely-used long-range benchmark and show none of the tested efficient
attentions can beat a simple local window attention under standard pretraining
paradigms. Further analysis on local attention variants suggests that even the
commonly used attention-window overlap is not necessary to achieve good
downstream results -- using disjoint local attentions, we are able to build a
simpler and more efficient long-doc QA model that matches the performance of
Longformer~\citep{longformer} with half of its pretraining compute.
Related papers
- CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - Opening the Black Box: Analyzing Attention Weights and Hidden States in
Pre-trained Language Models for Non-language Tasks [0.8889304968879164]
We apply a pre-trained language model to constrained arithmetic problems with hierarchical structure, to analyze their attention weight scores and hidden states.
The investigation reveals promising results, with the model addressing hierarchical problems in a moderately structured manner, similar to human problem-solving strategies.
The attention analysis allows us to hypothesize that the model can generalize to longer sequences in ListOps dataset, a conclusion later confirmed through testing on sequences longer than those in the training set.
arXiv Detail & Related papers (2023-06-21T11:48:07Z) - BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model
From Scratch? [0.0]
We train Longformer models with the efficient RTD task on legal data to showcase that pretraining efficient LMs is possible using much less compute.
We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain tasks.
arXiv Detail & Related papers (2022-11-30T16:09:20Z) - Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs.
We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z) - Deep Generative model with Hierarchical Latent Factors for Time Series
Anomaly Detection [40.21502451136054]
This work presents DGHL, a new family of generative models for time series anomaly detection.
A top-down Convolution Network maps a novel hierarchical latent space to time series windows, exploiting temporal dynamics to encode information efficiently.
Our method outperformed current state-of-the-art models on four popular benchmark datasets.
arXiv Detail & Related papers (2022-02-15T17:19:44Z) - SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance.
We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z) - On Model Calibration for Long-Tailed Object Detection and Instance
Segmentation [56.82077636126353]
We propose NorCal, Normalized for long-tailed object detection and instance segmentation.
We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance.
arXiv Detail & Related papers (2021-07-05T17:57:20Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.