Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation
- URL: http://arxiv.org/abs/2503.06868v1
- Date: Mon, 10 Mar 2025 02:44:36 GMT
- Title: Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation
- Authors: Junhao Zhang, Richong Zhang, Fanshuang Kong, Ziyang Miao, Yanhan Ye, Yaowei Zheng,
- Abstract summary: Long-text generation methods primarily concentrate on producing lengthy texts from short inputs.<n>As the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon.<n>We develop the Retrieval-Augmented Long-Text Writer (RAL-Writer) which retrieves and restates important yet overlooked content.
- Score: 22.0671489874715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing long-text generation methods primarily concentrate on producing lengthy texts from short inputs, neglecting the long-input and long-output tasks. Such tasks have numerous practical applications while lacking available benchmarks. Moreover, as the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon. In this paper, we first introduce a Long Input and Output Benchmark (LongInOutBench), including a synthetic dataset and a comprehensive evaluation framework, addressing the challenge of the missing benchmark. We then develop the Retrieval-Augmented Long-Text Writer (RAL-Writer), which retrieves and restates important yet overlooked content, mitigating the "lost-in-the-middle" issue by constructing explicit prompts. We finally employ the proposed LongInOutBench to evaluate our RAL-Writer against comparable baselines, and the results demonstrate the effectiveness of our approach. Our code has been released at https://github.com/OnlyAR/RAL-Writer.
Related papers
- LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams [4.917265821383127]
We construct the first spoken long-text dataset, derived from live streams, to reflect the redundancy-rich and conversational nature of real-world scenarios.
We evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks.
Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding.
arXiv Detail & Related papers (2025-04-24T08:27:48Z) - RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery [69.41989381702858]
Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency.<n>We propose RAPID, an efficient retrieval-augmented long text generation framework.<n>Our work provides a robust and efficient solution to the challenges of automated long-text generation.
arXiv Detail & Related papers (2025-03-02T06:11:29Z) - Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs [23.960451986662996]
This paper proposes a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought reasoning.<n>We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text.
arXiv Detail & Related papers (2025-02-18T02:49:40Z) - LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation [79.90766312484489]
Long Context Pre-training with Restoration Distillation (LongReD)<n>LongReD distills the hidden state of selected layers from the original model on short texts.<n>Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance.
arXiv Detail & Related papers (2025-02-11T08:37:16Z) - LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information [76.26257306813899]
Long-form generation is crucial for academic writing papers and repo-level code generation.
Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts.
We propose enhancing long-form generation by incorporating process supervision.
arXiv Detail & Related papers (2025-02-04T08:25:17Z) - HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models [89.28591263741973]
We introduce the Hierarchical Long Text Generation Benchmark (HelloBench) to evaluate Large Language Models' performance in generating long text.
Based on Bloom's taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and text generation.
Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human evaluation method that significantly reduces the time and effort required for human evaluation.
arXiv Detail & Related papers (2024-09-24T15:38:11Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes.
Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens.
We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.