Related papers: When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare

When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare

URL: http://arxiv.org/abs/2511.06668v1
Date: Mon, 10 Nov 2025 03:27:54 GMT
Title: When Evidence Contradicts: Toward Safer Retrieval-Augmented Generation in Healthcare
Authors: Saeedeh Javadi, Sara Mirabi, Manan Gangar, Bahadorreza Ofoghi,
Abstract summary: This work investigates the performance of five large language models (LLMs) in generating responses to medicine-related queries.<n>Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers.
Score: 0.05249805590164902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In high-stakes information domains such as healthcare, where large language models (LLMs) can produce hallucinations or misinformation, retrieval-augmented generation (RAG) has been proposed as a mitigation strategy, grounding model outputs in external, domain-specific documents. Yet, this approach can introduce errors when source documents contain outdated or contradictory information. This work investigates the performance of five LLMs in generating RAG-based responses to medicine-related queries. Our contributions are three-fold: i) the creation of a benchmark dataset using consumer medicine information documents from the Australian Therapeutic Goods Administration (TGA), where headings are repurposed as natural language questions, ii) the retrieval of PubMed abstracts using TGA headings, stratified across multiple publication years, to enable controlled temporal evaluation of outdated evidence, and iii) a comparative analysis of the frequency and impact of outdated or contradictory content on model-generated responses, assessing how LLMs integrate and reconcile temporally inconsistent information. Our findings show that contradictions between highly similar abstracts do, in fact, degrade performance, leading to inconsistencies and reduced factual accuracy in model answers. These results highlight that retrieval similarity alone is insufficient for reliable medical RAG and underscore the need for contradiction-aware filtering strategies to ensure trustworthy responses in high-stakes domains.

Related papers

MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering [21.855579328680246]
We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and hallucinations in medical QA.<n>First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents.<n>Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy.
arXiv Detail & Related papers (2025-10-16T07:59:11Z)
Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain [8.094811345546118]
Retrieval augmented generation (RAG) systems provide a method for factually grounding the responses of a Large Language Model (LLM) by providing retrieved evidence, or context, as support.<n>This design introduces a critical vulnerability: LLMs may absorb and reproduce misinformation present in retrieved evidence.<n>This problem is magnified if retrieved evidence contains adversarial material explicitly intended to promulgate misinformation.
arXiv Detail & Related papers (2025-09-04T00:45:58Z)
Controlled Retrieval-augmented Context Evaluation for Long-form RAG [58.14561461943611]
Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources.<n>We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation.<n>We introduce CRUX, a framework designed to directly assess retrieval-augmented contexts.
arXiv Detail & Related papers (2025-06-24T23:17:48Z)
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation [108.13261761812517]
We introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs.<n>We present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness.
arXiv Detail & Related papers (2025-05-27T11:56:59Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.<n>In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.<n>We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents [64.43980129731587]
We propose a causal-inspired inference-time debiasing method called Causal Diagnosis and Correction (CDC)<n>CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall relevance score.<n> Experimental results across three domains demonstrate the superior debiasing effectiveness.
arXiv Detail & Related papers (2025-03-11T17:59:00Z)
Enhancing Health Information Retrieval with RAG by Prioritizing Topical Relevance and Factual Accuracy [0.7673339435080445]
This paper introduces a solution driven by Retrieval-Augmented Generation (RAG) to enhance the retrieval of health-related documents grounded in scientific evidence.<n>In particular, we propose a three-stage model: in the first stage, the user's query is employed to retrieve topically relevant passages with associated references from a knowledge base constituted by scientific literature.<n>In the second stage, these passages, alongside the initial query, are processed by LLMs to generate a contextually relevant rich text (GenText)<n>In the last stage, the documents to be retrieved are evaluated and ranked both from the point of
arXiv Detail & Related papers (2025-02-07T05:19:13Z)
Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain [26.72234494972736]
Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries.<n>Our study focuses on the impact of RAG, specifically examining whether RAG improves the confidence of LLM outputs in the medical domain.<n>We evaluate confidence by treating the model's predicted probability as its output and calculating several evaluation metrics which include calibration error method, entropy, the best probability, and accuracy.
arXiv Detail & Related papers (2024-12-29T00:58:33Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models [35.60385437194243]
Current Medical Large Vision Language Models (Med-LVLMs) frequently encounter factual issues. RAG, which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. We propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the selection of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model.
arXiv Detail & Related papers (2024-07-06T16:45:07Z)
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs) Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.