MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph
- URL: http://arxiv.org/abs/2512.13510v1
- Date: Mon, 15 Dec 2025 16:38:46 GMT
- Title: MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph
- Authors: Linjie Mu, Yannian Gu, Zhongzhen Huang, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang,
- Abstract summary: We propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways.<n>To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward.<n> Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains.
- Score: 17.320322032287894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at https://github.com/LinjieMu/MedCEG.
Related papers
- CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework [29.22693846221723]
We introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework.<n> CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination.<n>Our CARE-Flow improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA)
arXiv Detail & Related papers (2026-03-02T08:38:37Z) - How Well Do Multimodal Models Reason on ECG Signals? [36.281141199783825]
We introduce a reproducible framework for evaluating reasoning in ECG signals.<n>We employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace.<n>This dual-verification method enables the scalable assessment of "true" reasoning capabilities.
arXiv Detail & Related papers (2026-02-27T21:04:12Z) - Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning [16.144050164828794]
We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies.<n>DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis.<n>At inference, we retrieve top-$k$ instructions to augment the agent prompt and patch likely logic gaps.
arXiv Detail & Related papers (2026-02-10T16:29:32Z) - AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z) - M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering [4.285647375182588]
Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning.<n>Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge.<n>We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting.
arXiv Detail & Related papers (2025-08-20T05:43:26Z) - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering [7.092919468004549]
Large Language Models (LLMs) demonstrate significant capabilities in medical text understanding and generation.<n>This study aims to enhance LLMs' diagnostic accuracy and clinical reasoning ability.
arXiv Detail & Related papers (2025-08-01T03:05:43Z) - ControlMed: Adding Reasoning Control to Medical Language Model [1.0207955314209531]
Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain.<n>Existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency.<n>We introduce textbfControlMed, a medical language model that enables users to actively control the length of the reasoning process at inference time.
arXiv Detail & Related papers (2025-07-30T10:17:07Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.