More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
- URL: http://arxiv.org/abs/2503.04388v1
- Date: Thu, 06 Mar 2025 12:38:17 GMT
- Title: More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
- Authors: Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky,
- Abstract summary: We evaluate various language models on custom datasets derived from a multi-hop QA task.<n>We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs.
- Score: 15.918767599423857
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
Related papers
- Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark [6.722613897911759]
Document Haystack is a benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long documents.<n>Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents.
arXiv Detail & Related papers (2025-07-18T19:33:15Z) - On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents [2.186901738997927]
We reproduce and examine the mechanisms of adapting Learned Sparse Retrieval (LSR) for long documents.
Our experiments confirmed the importance of specific segments, with the first segment consistently dominating document retrieval performance.
We re-evaluated recently proposed methods -- ExactSDM and SoftSDM -- across varying document lengths.
arXiv Detail & Related papers (2025-03-31T08:19:31Z) - HERA: Improving Long Document Summarization using Large Language Models with Context Packaging and Reordering [6.876612430571396]
We propose a novel summary generation framework, called HERA.<n>We first segment a long document by its semantic structure and retrieve text segments about the same event, and finally reorder them to form the input context.<n>The experimental results show that HERA outperforms foundation models in ROUGE, BERTScore and faithfulness metrics.
arXiv Detail & Related papers (2025-02-01T14:55:06Z) - M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding [103.69014172427026]
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents.
We present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG) which can broaden horizons of any MLLM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z) - SEGMENT+: Long Text Processing with Short-Context Language Models [53.40059130780192]
SEGMENT+ is a framework that enables LMs to handle extended inputs within limited context windows efficiently.
SEGMENT+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable.
arXiv Detail & Related papers (2024-10-09T03:40:22Z) - DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering [4.364937306005719]
RAG has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA)
We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query.
A two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers.
arXiv Detail & Related papers (2024-06-11T15:15:33Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.