BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2601.07329v1
- Date: Mon, 12 Jan 2026 08:53:14 GMT
- Title: BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation
- Authors: Xuan Li, Yining Wang, Haocai Luo, Shengping Liu, Jerry Liang, Ying Fu, Weihuang, Jun Yu, Junnan Zhu,
- Abstract summary: BayesRAG is a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory.<n>We show that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks.
- Score: 33.53566598271416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at https://github.com/TioeAre/BayesRAG.
Related papers
- DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z) - Propagating Similarity, Mitigating Uncertainty: Similarity Propagation-enhanced Uncertainty for Multimodal Recommendation [26.819070711100206]
We propose a novel framework, Similarity -enhanced Uncertainty for Multimodal Recommendation (SPUMR)<n>SPUMR explicitly models and mitigates uncertainty by first constructing the Modality Similarity Graph and the Collaborative Similarity Graph.<n>Experiments on three benchmark datasets demonstrate that SPUMR achieves significant improvements over existing leading methods.
arXiv Detail & Related papers (2026-01-27T04:53:59Z) - MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation [31.90681057778075]
Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge.<n>Existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation.
arXiv Detail & Related papers (2025-12-19T03:19:54Z) - Empirical Bayesian Multi-Bandit Learning [8.980876474818153]
Multi-task learning in contextual bandits has attracted significant research interest.<n>We propose a novel hierarchical Bayesian framework for learning in various bandit instances.<n>We show that our algorithms achieve lower cumulative regret compared to existing techniques.
arXiv Detail & Related papers (2025-10-30T09:08:07Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z) - Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis [9.946694131713611]
A common assumption in probabilistic generative models for image generation is that learning the global data distribution suffices to generate novel images via sampling.<n>We investigate the limitation of this core assumption, namely that learning global distributions leads to memorization rather than generative behavior.
arXiv Detail & Related papers (2025-06-26T19:32:29Z) - RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition [17.612203615672744]
We propose a novel framework that performs missing-modality recovery at unimodal, multimodal, feature, and semantic levels.<n>In contrast to previous work, the hybrid diffusion and adversarial learning-based recovery mechanism in RoHyDR allows recovery of missing information in both unimodal representation and multimodal fusion.<n>Our proposed method outperforms state-of-the-art IMER methods, achieving robust recognition performance under various missing-modality scenarios.
arXiv Detail & Related papers (2025-05-23T05:52:17Z) - GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval [13.928213494843744]
Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples.<n>Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data.<n>We introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions.
arXiv Detail & Related papers (2025-05-19T16:25:55Z) - CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.