Long-Short Alignment for Effective Long-Context Modeling in LLMs
- URL: http://arxiv.org/abs/2506.11769v1
- Date: Fri, 13 Jun 2025 13:25:39 GMT
- Title: Long-Short Alignment for Effective Long-Context Modeling in LLMs
- Authors: Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang,
- Abstract summary: Large language models (LLMs) have exhibited impressive performance and surprising emergent properties.<n>Length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem.<n>We highlight the critical role of textbflong-short alignment -- the consistency of output distributions across sequences of varying lengths.
- Score: 32.13785291956956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} -- the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.
Related papers
- Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z) - SEAL: Scaling to Emphasize Attention for Long-Context Retrieval [8.805524738976075]
We introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL)<n>We observe that specific attention heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores.<n>We propose a learning-based mechanism that leverages generated data to emphasize these heads.
arXiv Detail & Related papers (2025-01-25T14:09:39Z) - Breaking the Context Bottleneck on Long Time Series Forecasting [6.36010639533526]
Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation.<n>Recent advancements have enhanced the efficiency of these models, but the challenge of effectively leveraging longer sequences persists.<n>We propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences.
arXiv Detail & Related papers (2024-12-21T10:29:34Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.<n>Perplexity (PPL) has proven unreliable for assessing long-context capabilities.<n>We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - A Controlled Study on Long Context Extension and Generalization in LLMs [85.4758128256142]
Broad textual understanding and in-context learning require language models that utilize full document contexts.
Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts.
We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data.
arXiv Detail & Related papers (2024-09-18T17:53:17Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs)
CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance.
Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - Simple Local Attentions Remain Competitive for Long-Context Tasks [32.785459927278616]
Many NLP tasks require processing long contexts beyond the length limit of pretrained models.
In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed.
For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks.
arXiv Detail & Related papers (2021-12-14T07:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.