CARE-RAG - Clinical Assessment and Reasoning in RAG
- URL: http://arxiv.org/abs/2511.15994v1
- Date: Thu, 20 Nov 2025 02:44:55 GMT
- Title: CARE-RAG - Clinical Assessment and Reasoning in RAG
- Authors: Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao, Junyuan Hong, Ying Ding,
- Abstract summary: We study the gap between retrieval and reasoning in large language models (LLMs)<n>We propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning.
- Score: 43.1450755645803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
Related papers
- Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification [60.18369393468405]
Existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration.<n>GLEAN compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals.<n>We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.
arXiv Detail & Related papers (2026-03-03T09:36:43Z) - How Well Do Multimodal Models Reason on ECG Signals? [36.281141199783825]
We introduce a reproducible framework for evaluating reasoning in ECG signals.<n>We employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace.<n>This dual-verification method enables the scalable assessment of "true" reasoning capabilities.
arXiv Detail & Related papers (2026-02-27T21:04:12Z) - Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning [16.144050164828794]
We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies.<n>DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis.<n>At inference, we retrieve top-$k$ instructions to augment the agent prompt and patch likely logic gaps.
arXiv Detail & Related papers (2026-02-10T16:29:32Z) - AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z) - Evaluating Large Language Models for Evidence-Based Clinical Question Answering [4.101088122511548]
Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications.<n>We curate a benchmark drawing from Cochrane systematic reviews and clinical guidelines.<n>We observe consistent performance patterns across sources and clinical domains.
arXiv Detail & Related papers (2025-09-13T15:03:34Z) - Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain [8.094811345546118]
Retrieval augmented generation (RAG) systems provide a method for factually grounding the responses of a Large Language Model (LLM) by providing retrieved evidence, or context, as support.<n>This design introduces a critical vulnerability: LLMs may absorb and reproduce misinformation present in retrieved evidence.<n>This problem is magnified if retrieved evidence contains adversarial material explicitly intended to promulgate misinformation.
arXiv Detail & Related papers (2025-09-04T00:45:58Z) - Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation [32.410641778559544]
ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation) is an interpretable evaluation framework.<n>Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other.<n>By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment.
arXiv Detail & Related papers (2025-08-04T18:28:03Z) - Controlled Retrieval-augmented Context Evaluation for Long-form RAG [58.14561461943611]
Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources.<n>We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation.<n>We introduce CRUX, a framework designed to directly assess retrieval-augmented contexts.
arXiv Detail & Related papers (2025-06-24T23:17:48Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1 [0.0]
This study assesses DeepSeek R1's medical reasoning against expert patterns using 100 MedQA clinical cases.<n>The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors.<n>Error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care.
arXiv Detail & Related papers (2025-03-27T09:18:08Z) - Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.