LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
- URL: http://arxiv.org/abs/2502.07365v2
- Date: Wed, 19 Feb 2025 10:49:24 GMT
- Title: LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
- Authors: Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen,
- Abstract summary: Long Context Pre-training with Restoration Distillation (LongReD)<n>LongReD distills the hidden state of selected layers from the original model on short texts.<n>Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance.
- Score: 79.90766312484489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.
Related papers
- Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation [22.0671489874715]
Long-text generation methods primarily concentrate on producing lengthy texts from short inputs.
As the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon.
We develop the Retrieval-Augmented Long-Text Writer (RAL-Writer) which retrieves and restates important yet overlooked content.
arXiv Detail & Related papers (2025-03-10T02:44:36Z) - LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm [21.661578831520963]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks.
Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation.
We present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms.
arXiv Detail & Related papers (2025-02-26T12:46:36Z) - NExtLong: Toward Effective Long-Context Training without Long Documents [28.002824369635768]
We propose NExtLong, a novel framework for long-context data through Negative document Extension.<n> NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora.<n>Extensive experiments demonstrate that NExtLong achieves significant performance improvements compared to existing long-context synthesis approaches.
arXiv Detail & Related papers (2025-01-22T10:01:54Z) - Length-Induced Embedding Collapse in Transformer-based Models [7.127156731612495]
We find that performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space.
This collapse results in a distributional inconsistency between embeddings of different text lengths, hurting the performance of downstream tasks.
We propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax() which achieves a higher low-filter attenuation rate.
arXiv Detail & Related papers (2024-10-31T17:55:36Z) - Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen.
It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models.
Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z) - LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We then help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding.
Our method demonstrates superior performance in long-text-image retrieval tasks.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - SirLLM: Streaming Infinite Retentive LLM [74.40196814292426]
Large Language Models (LLMs) process inputs of any length and maintain a degree of memory.
Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs.
We introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues.
arXiv Detail & Related papers (2024-05-21T06:37:03Z) - LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment.
We construct a long instruction-following dataset using Self-Instruct.
We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs.
We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Reinforced Abstractive Summarization with Adaptive Length Controlling [12.793451906532223]
Controllable summarization, especially of the length, is an important issue for some practical applications.
We propose an textbfAdaptive textbfLength textbfControlling textbfOptimization (textbfALCO) method to leverage two-stage abstractive summarization model.
arXiv Detail & Related papers (2021-12-14T16:48:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.