Multi-view Content-aware Indexing for Long Document Retrieval
- URL: http://arxiv.org/abs/2404.15103v1
- Date: Tue, 23 Apr 2024 14:55:32 GMT
- Title: Multi-view Content-aware Indexing for Long Document Retrieval
- Authors: Kuicai Dong, Derrick Goh Xin Deik, Yi Quan Lee, Hao Zhang, Xiangyang Li, Cong Zhang, Yong Liu,
- Abstract summary: Long document question answering (DocQA) aims to answer questions from long documents over 10k words.
We propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA.
MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively.
- Score: 19.74258792456242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.
Related papers
- MoDora: Tree-Based Semi-Structured Document Analysis System [62.01015188258797]
Semi-structured documents integrate diverse interleaved data elements arranged in various and often irregular layouts.<n>MoDora is an LLM-powered system for semi-structured document analysis.<n> Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy.
arXiv Detail & Related papers (2026-02-26T14:48:49Z) - MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation [3.537921035534424]
Multimodal Chunk-Query Graph (MCQG) generates semantically rich, answerable queries from heterogeneous document chunks.<n>This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation.<n>Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy.
arXiv Detail & Related papers (2026-02-10T20:29:10Z) - $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA [53.491241153213565]
$G2$-Reader is a dual-graph system for multimodal question answering.<n>$G2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%)<n>On VisDoMBench across five multimodal domains, $G2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%)
arXiv Detail & Related papers (2026-01-29T17:52:54Z) - Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation [0.0]
Cross-Document Topic-Aligned chunking reconstructs knowledge at the corpus level.<n>It first identifies topics across documents, maps segments to each topic, and synthesizes them into unified chunks.
arXiv Detail & Related papers (2025-11-08T11:45:45Z) - LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding [37.12229829548839]
We propose LAD-RAG, a novel layout-Aware Dynamic RAG framework.<n>LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies.<n>Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning.
arXiv Detail & Related papers (2025-10-08T17:02:04Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval [68.71038700559195]
Chain of Retrieval(COR) is a novel iterative framework for full-paper retrieval.<n>We present SCIBENCH, a benchmark providing both complete and segmented contexts of full papers for queries and candidates.
arXiv Detail & Related papers (2025-07-14T08:41:53Z) - A Unified Retrieval Framework with Document Ranking and EDU Filtering for Multi-document Summarization [18.13855430873805]
Current methods apply truncation after the retrieval process to fit the context length.
We propose a novel retrieval-based framework that integrates query selection and document ranking.
We evaluate our framework on multiple MDS datasets, demonstrating consistent improvements in ROUGE metrics.
arXiv Detail & Related papers (2025-04-23T13:41:10Z) - MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval.
The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions.
arXiv Detail & Related papers (2025-01-15T14:30:13Z) - LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating [40.44974704748952]
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably.
Existing document understanding benchmarks have been limited to handling only a small number of pages.
We develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents.
arXiv Detail & Related papers (2024-12-24T13:39:32Z) - M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD)
We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Multimodal Tree Decoder for Table of Contents Extraction in Document
Images [32.46909366312659]
Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents.
We first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels.
We propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc.
arXiv Detail & Related papers (2022-12-06T11:38:31Z) - End-to-End Multihop Retrieval for Compositional Question Answering over
Long Documents [93.55268936974971]
We propose a multi-hop retrieval method, DocHopper, to answer compositional questions over long documents.
At each step, DocHopper retrieves a paragraph or sentence embedding from the document, mixes the retrieved result with the query, and updates the query for the next step.
We demonstrate that utilizing document structure in this was can largely improve question-answering and retrieval performance on long documents.
arXiv Detail & Related papers (2021-06-01T03:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.