Related papers: Training With "Paraphrasing the Original Text" Improves Long-Context Performance

Training With "Paraphrasing the Original Text" Improves Long-Context Performance

URL: http://arxiv.org/abs/2312.11193v9
Date: Wed, 21 Aug 2024 09:31:02 GMT
Title: Training With "Paraphrasing the Original Text" Improves Long-Context Performance
Authors: Yijiong Yu, Yongfeng Huang, Zhixiao Qi, Zhe Zhou,
Abstract summary: Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. We propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores.
Score: 19.48556587305737
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. Despite this advancement, most of them still face challenges in accurately handling long-context tasks, often showing the "lost in the middle" issue. We identify that insufficient retrieval capability is one of the important reasons for this issue. To tackle this challenge, we propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Specially, we incorporate an additional part named "paraphrasing the original text" when constructing the answer of training samples and then fine-tuning the model. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores, respectively, showing effectiveness in improving the model' s performance on long-context tasks. The model and training data have been made available on HuggingFace(https://huggingface.co/yuyijiong/Qwen-14b-chat-yarn-32k).

Related papers

Document Reconstruction Unlocks Scalable Long-Context RLVR [60.74632963522131]
Reinforcement Learning with Verifiable Rewards(RLVR) has become a prominent paradigm to enhance the capabilities (i.e. long-context) of Large Language Models(LLMs)<n>We investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.<n>We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBenchv2.
arXiv Detail & Related papers (2026-02-09T03:23:23Z)
SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z)
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling. We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones. Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen. It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models [21.90388980448712]
Training models to handle long contexts presents significant challenges. We introduce Untie the Knots (textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase. We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length.
arXiv Detail & Related papers (2024-09-07T09:28:55Z)
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning [68.43706033424378]
This study introduces an innovative method designed to increase in-context text length in large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage.
arXiv Detail & Related papers (2024-06-04T17:59:25Z)
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens. We develop two novel methods for creating synthetic data. LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications. We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
Stabilized In-Context Learning with Pre-trained Language Models for Few Shot Dialogue State Tracking [57.92608483099916]
Large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks. For more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial. We introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query.
arXiv Detail & Related papers (2023-02-12T15:05:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.