Related papers: HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents

HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents

URL: http://arxiv.org/abs/2511.20227v1
Date: Tue, 25 Nov 2025 11:59:52 GMT
Title: HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents
Authors: Anyang Tong, Xiang Niu, ZhiPing Liu, Chang Tian, Yanyan Wei, Zenglin Shi, Meng Wang,
Abstract summary: We propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types.<n>Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation.
Score: 18.42875699937102
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.

Related papers

ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection [23.87220484843729]
multimodal fake news poses a serious societal threat.<n> Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval.<n>We propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection.
arXiv Detail & Related papers (2026-01-22T10:10:06Z)
Multi-hop Reasoning via Early Knowledge Alignment [68.28168992785896]
Early Knowledge Alignment (EKA) aims to align Large Language Models with contextually relevant retrieved knowledge.<n>EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency.<n>EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models.
arXiv Detail & Related papers (2025-12-23T08:14:44Z)
Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation [61.47019392413271]
WinnowRAG is designed to systematically filter out noisy documents while preserving valuable content.<n>WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters.<n>In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones.
arXiv Detail & Related papers (2025-11-01T20:08:13Z)
Open Multimodal Retrieval-Augmented Factual Image Generation [86.34546873830152]
We introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG)<n> ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation.<n>Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines.
arXiv Detail & Related papers (2025-10-26T04:13:31Z)
Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering [8.830228556155673]
We propose MI-RAG, a framework that leverages reasoning to enhance retrieval and incorporates knowledge synthesis to refine its understanding.<n>Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy.
arXiv Detail & Related papers (2025-08-31T11:14:54Z)
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z)
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval [31.69320295943039]
Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA)<n>We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy.
arXiv Detail & Related papers (2025-05-10T14:24:41Z)
UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities [53.76854299076118]
UniversalRAG is a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities.<n>We propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it.<n>We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over various modality-specific and unified baselines.
arXiv Detail & Related papers (2025-04-29T13:18:58Z)
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.<n>Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.<n>This study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments.<n>Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs.
arXiv Detail & Related papers (2025-02-28T11:25:38Z)
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents [27.90338725230132]
ViDoSeek is a dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning.<n>We propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents.<n> Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.
arXiv Detail & Related papers (2025-02-25T09:26:12Z)
TrustRAG: An Information Assistant with Retrieval Augmented Generation [73.84864898280719]
TrustRAG is a novel framework that enhances acRAG from three perspectives: indexing, retrieval, and generation.<n>We open-source the TrustRAG framework and provide a demonstration studio designed for excerpt-based question answering tasks.
arXiv Detail & Related papers (2025-02-19T13:45:27Z)
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches. This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.