RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering
- URL: http://arxiv.org/abs/2603.03541v1
- Date: Tue, 03 Mar 2026 21:55:42 GMT
- Title: RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering
- Authors: Aswini Sivakumar, Vijayan Sugumaran, Yao Qiang,
- Abstract summary: We propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks.<n>By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.
- Score: 3.5512614044131596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only on simple multiple-choice QA tasks and employ metrics that poorly capture the semantic precision required for complex QA tasks. These approaches fail to diagnose whether an error stems from faulty retrieval or flawed generation, limiting developers from performing targeted improvement. To address this gap, we propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering. RAG-X introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy. Our experiments reveal an ``Accuracy Fallacy", where a 14\% gap separates perceived system success from evidence-based grounding. By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.
Related papers
- MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation [1.9036571490366498]
We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks.<n>We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool.
arXiv Detail & Related papers (2025-09-29T03:55:28Z) - End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning [52.12425911708585]
Deep-DxSearch is an agentic RAG system trained end-to-end with reinforcement learning (RL)<n>In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources.<n> Experiments demonstrate that our end-to-end RL training framework consistently outperforms prompt-engineering and training-free RAG approaches.
arXiv Detail & Related papers (2025-08-21T17:42:47Z) - A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation [4.61181046331792]
We propose a framework named chain of diagnosis (CoD), which maintains a chain of diagnostic process for clinically accurate and explainable RRG.<n>To enhance explainability, a diagnosis grounding module is designed to match QA diagnoses and generated sentences, where the diagnoses act as a reference.<n>Our efforts lead to 1) an omni-labeled RRG dataset with QA pairs and lesion boxes; 2) a evaluation tool for assessing the accuracy of reports in describing lesion location and severity; 3) extensive experiments to demonstrate the effectiveness of CoD.
arXiv Detail & Related papers (2025-08-13T07:32:28Z) - Hide and Seek with LLMs: An Adversarial Game for Sneaky Error Generation and Self-Improving Diagnosis [51.88592148135258]
We propose Hide and Seek Game (HSG), a dynamic adversarial framework for error generation and diagnosis.<n>HSG involves two adversarial roles: Sneaky, which "hides" by generating subtle, deceptive reasoning errors, and Diagnosis, which "seeks" to accurately detect them.<n> Experiments on several math reasoning tasks show that HSG significantly boosts error diagnosis, achieving 16.8%--31.4% higher accuracy than baselines like GPT-4o.
arXiv Detail & Related papers (2025-08-05T12:45:21Z) - A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering [3.3809462259925938]
Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images.<n>Recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA.<n>We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA.
arXiv Detail & Related papers (2025-08-04T19:09:52Z) - Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation [108.13261761812517]
We introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs.<n>We present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness.
arXiv Detail & Related papers (2025-05-27T11:56:59Z) - Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA [17.823588070044217]
We propose Discuss-RAG, a plug-and-play module designed to enhance the medical question answering system.<n>Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content.<n> Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG.
arXiv Detail & Related papers (2025-04-30T01:37:44Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation [61.14660526363607]
We propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules.
RAGChecker has significantly better correlations with human judgments than other evaluation metrics.
The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
arXiv Detail & Related papers (2024-08-15T10:20:54Z) - Detecting Spurious Correlations with Sanity Tests for Artificial
Intelligence Guided Radiology Systems [22.249702822013045]
A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety.
The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset.
We describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons.
arXiv Detail & Related papers (2021-03-04T14:14:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.