Related papers: Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training

Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training

URL: http://arxiv.org/abs/2311.09198v2
Date: Tue, 13 Aug 2024 19:04:18 GMT
Title: Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training
Authors: Junqing He, Kunhao Pan, Xiaoqun Dong, Zhuoyang Song, Yibo Liu, Qianguo Sun, Yuxin Liang, Hao Wang, Enming Zhang, Jiaxing Zhang,
Abstract summary: Large language models (LLMs) are struggling to seek correct information in long contexts. This paper proposes to enhance the information searching and reflection ability of LLMs in long contexts via specially designed tasks. Experimental results show substantial improvement in Multi-doc QA and other benchmarks, superior to state-of-the-art models by 13.7% absolute gain in shuffled settings.
Score: 9.128501882000315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) are equipped with longer text input capabilities than before, they are struggling to seek correct information in long contexts. The "lost in the middle" problem challenges most LLMs, referring to the dramatic decline in accuracy when correct information is located in the middle. To overcome this crucial issue, this paper proposes to enhance the information searching and reflection ability of LLMs in long contexts via specially designed tasks called Attention Strengthening Multi-doc QA (ASM QA). Following these tasks, our model excels in focusing more precisely on the desired information. Experimental results show substantial improvement in Multi-doc QA and other benchmarks, superior to state-of-the-art models by 13.7% absolute gain in shuffled settings, by 21.5% in passage retrieval task. We release our model, Ziya-Reader to promote related research in the community.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources. We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction [25.00896070082754]
Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. A persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. We propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework.
arXiv Detail & Related papers (2025-04-08T01:45:16Z)
Reducing Distraction in Long-Context Language Models by Focused Learning [6.803882766744194]
We propose a novel training method that enhances Large Language Models' ability to discern relevant information. During fine-tuning with long contexts, we employ a retriever to extract the most relevant segments. We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned.
arXiv Detail & Related papers (2024-11-08T19:27:42Z)
Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs [50.40165119718928]
LongPiBench is a benchmark designed to assess positional bias involving multiple pieces of relevant information. These experiments reveal that while most current models are robust against the "lost in the middle" issue, there exist significant biases related to the spacing of relevant information pieces.
arXiv Detail & Related papers (2024-10-18T17:41:19Z)
ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering [42.146660039671076]
We develop a retrieve-then-reason framework for large language models (LLMs) We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts" We introduce ALR$2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure.
arXiv Detail & Related papers (2024-10-04T08:29:12Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization [97.84156490765457]
Large language models (LLMs) struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. We show found-in-the-middle achieves better performance in locating relevant information within a long context.
arXiv Detail & Related papers (2024-06-23T04:35:42Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding [78.36702055076456]
This paper introduces Multi-scale Positional. (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of. LLMs to handle relevant information located in the middle of the context.
arXiv Detail & Related papers (2024-03-05T04:58:37Z)
Training With "Paraphrasing the Original Text" Improves Long-Context Performance [19.48556587305737]
Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. We propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores.
arXiv Detail & Related papers (2023-12-18T13:40:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.