Simple Local Attentions Remain Competitive for Long-Context Tasks
- URL: http://arxiv.org/abs/2112.07210v1
- Date: Tue, 14 Dec 2021 07:37:58 GMT
- Title: Simple Local Attentions Remain Competitive for Long-Context Tasks
- Authors: Wenhan Xiong, Barlas O\u{g}uz, Anchit Gupta, Xilun Chen, Diana
Liskovich, Omer Levy, Wen-tau Yih, Yashar Mehdad
- Abstract summary: Many NLP tasks require processing long contexts beyond the length limit of pretrained models.
In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed.
For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks.
- Score: 32.785459927278616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many NLP tasks require processing long contexts beyond the length limit of
pretrained models. In order to scale these models to longer text sequences,
many efficient long-range attention variants have been proposed. Despite the
abundance of research along this direction, it is still difficult to gauge the
relative effectiveness of these models in practical use cases, e.g., if we
apply these models following the pretrain-and-finetune paradigm. In this work,
we aim to conduct a thorough analysis of these emerging models with large-scale
and controlled experiments. For each attention variant, we pretrain large-size
models using the same long-doc corpus and then finetune these models for
real-world long-context tasks. Our findings reveal pitfalls of an existing
widely-used long-range benchmark and show none of the tested efficient
attentions can beat a simple local window attention under standard pretraining
paradigms. Further analysis on local attention variants suggests that even the
commonly used attention-window overlap is not necessary to achieve good
downstream results -- using disjoint local attentions, we are able to build a
simpler and more efficient long-doc QA model that matches the performance of
Longformer~\citep{longformer} with half of its pretraining compute.
Related papers
- Long-Short Alignment for Effective Long-Context Modeling in LLMs [32.13785291956956]
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties.<n>Length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem.<n>We highlight the critical role of textbflong-short alignment -- the consistency of output distributions across sequences of varying lengths.
arXiv Detail & Related papers (2025-06-13T13:25:39Z) - Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.
Our approach only leverages a small number of samples to search for the desired pruning policy.
We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.
We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Breaking the Context Bottleneck on Long Time Series Forecasting [6.36010639533526]
Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation.
Recent advancements have enhanced the efficiency of these models, but the challenge of effectively leveraging longer sequences persists.
We propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences.
arXiv Detail & Related papers (2024-12-21T10:29:34Z) - Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models [58.6172667880028]
We propose a new method called forgetting curve to measure the memorization capability of long-context models.
We show that forgetting curve has the advantage of being robust to the tested corpus and the experimental settings.
Our measurement provides empirical evidence for the effectiveness of transformer extension techniques while raises questions for the effective length of RNN/SSM based models.
arXiv Detail & Related papers (2024-10-07T03:38:27Z) - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed.
We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - Opening the Black Box: Analyzing Attention Weights and Hidden States in
Pre-trained Language Models for Non-language Tasks [0.8889304968879164]
We apply a pre-trained language model to constrained arithmetic problems with hierarchical structure, to analyze their attention weight scores and hidden states.
The investigation reveals promising results, with the model addressing hierarchical problems in a moderately structured manner, similar to human problem-solving strategies.
The attention analysis allows us to hypothesize that the model can generalize to longer sequences in ListOps dataset, a conclusion later confirmed through testing on sequences longer than those in the training set.
arXiv Detail & Related papers (2023-06-21T11:48:07Z) - BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model
From Scratch? [0.0]
We train Longformer models with the efficient RTD task on legal data to showcase that pretraining efficient LMs is possible using much less compute.
We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain tasks.
arXiv Detail & Related papers (2022-11-30T16:09:20Z) - Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs.
We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z) - SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance.
We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.