$G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA
- URL: http://arxiv.org/abs/2601.22055v1
- Date: Thu, 29 Jan 2026 17:52:54 GMT
- Title: $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA
- Authors: Yaxin Du, Junru Song, Yifan Zhou, Cheng Wang, Jiahao Gu, Zimeng Chen, Menglan Chen, Wen Yao, Yang Yang, Ying Wen, Siheng Chen,
- Abstract summary: $G2$-Reader is a dual-graph system for multimodal question answering.<n>$G2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%)<n>On VisDoMBench across five multimodal domains, $G2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%)
- Score: 53.491241153213565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).
Related papers
- MoDora: Tree-Based Semi-Structured Document Analysis System [62.01015188258797]
Semi-structured documents integrate diverse interleaved data elements arranged in various and often irregular layouts.<n>MoDora is an LLM-powered system for semi-structured document analysis.<n> Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy.
arXiv Detail & Related papers (2026-02-26T14:48:49Z) - LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval [13.855117422052315]
LILaC is a multimodal retrieval framework featuring two core innovations.<n>First, we introduce a layered component graph, explicitly representing multimodal information at two layers.<n>Second, we develop a late-interaction-based subgraph retrieval method.
arXiv Detail & Related papers (2026-02-04T06:55:48Z) - WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora [34.720109050809285]
Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph.<n>Many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge.<n>We introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild.
arXiv Detail & Related papers (2026-02-02T12:55:29Z) - N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs [0.0]
Multi-hop question answering over hybrid table-text data requires retrieving and reasoning across multiple evidence pieces from large corpora.<n>Standard Retrieval-Augmented Generation (RAG) pipelines process documents as flat ranked lists, causing retrieval noise to obscure reasoning chains.<n>N2N-GQA is the first zeroshot framework for open-domain hybrid table-text QA that constructs dynamic evidence graphs from noisy retrieval outputs.
arXiv Detail & Related papers (2026-01-10T15:55:15Z) - Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z) - MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z) - MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z) - Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval [22.33550491040999]
RAG grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents.<n>We build two plug-and-play retrievers: StatementGraphRAG and TopicGraphRAG.<n>Our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness.
arXiv Detail & Related papers (2025-06-09T17:58:35Z) - Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs [12.745520645025808]
We benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents.<n>Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy.
arXiv Detail & Related papers (2025-05-12T10:59:30Z) - SgSum: Transforming Multi-document Summarization into Sub-graph
Selection [27.40759123902261]
Most existing extractive multi-document summarization (MDS) methods score each sentence individually and extract salient sentences one by one to compose a summary.
We propose a novel MDS framework (SgSum) to formulate the MDS task as a sub-graph selection problem.
Our model can produce significantly more coherent and informative summaries compared with traditional MDS methods.
arXiv Detail & Related papers (2021-10-25T05:12:10Z) - BASS: Boosting Abstractive Summarization with Unified Semantic Graph [49.48925904426591]
BASS is a framework for Boosting Abstractive Summarization based on a unified Semantic graph.
A graph-based encoder-decoder model is proposed to improve both the document representation and summary generation process.
Empirical results show that the proposed architecture brings substantial improvements for both long-document and multi-document summarization tasks.
arXiv Detail & Related papers (2021-05-25T16:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.