Related papers: Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

URL: http://arxiv.org/abs/2507.20136v1
Date: Sun, 27 Jul 2025 05:45:45 GMT
Title: Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG
Authors: Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim,
Abstract summary: This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM)<n>The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate.<n>Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification.
Score: 3.9063541371093184
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition's scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .

Related papers

QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering [27.567923098020586]
We propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA.<n>By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning.<n>We evaluate our framework on the Meta CRAG-MM Challenge at KDD Cup 2025.
arXiv Detail & Related papers (2025-08-07T09:32:49Z)
Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering [7.481274094559558]
This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team.<n>We use a single model for each task, with key methods including data augmentation, RAG, reranking, and fine-tuning.<n>Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.
arXiv Detail & Related papers (2025-07-29T06:07:59Z)
RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking [15.160356035522609]
RAMA is a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation.<n> RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture.
arXiv Detail & Related papers (2025-07-12T07:46:51Z)
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning [9.647162327984638]
Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve.<n>We propose MAPLE, a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop.
arXiv Detail & Related papers (2025-06-06T07:21:28Z)
Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation [77.10390725623125]
Long-form question answering (LFQA) presents unique challenges for large language models.<n>RioRAG is a novel reinforcement learning framework that advances long-form RAG through reinforced informativeness optimization.
arXiv Detail & Related papers (2025-05-27T07:34:41Z)
Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z)
Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning [27.867369806400834]
We propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework.<n>RAGPT comprises three modules: (I) the multi-channel retriever, (II) the missing modality generator, and (III) the context-aware prompter.<n>Experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems.
arXiv Detail & Related papers (2025-01-02T07:39:48Z)
Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [92.5712549836791]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z)
RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection [61.71770293720491]
We propose a novel two-stage Robust modAlity-imcomplete fusing and Detecting frAmewoRk, abbreviated as RADAR. Our bootstrapping philosophy is to enhance two stages in MIIAD, improving the robustness of the Multimodal Transformer. Our experimental results demonstrate that the proposed RADAR significantly surpasses conventional MIAD methods in terms of effectiveness and robustness.
arXiv Detail & Related papers (2024-10-02T16:47:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.