Related papers: M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

URL: http://arxiv.org/abs/2601.08758v2
Date: Wed, 14 Jan 2026 04:19:06 GMT
Title: M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
Authors: Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan,
Abstract summary: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
Score: 66.78251988482222
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.

Related papers

Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs [54.8829900010621]
Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision.<n>We present a comprehensive framework to address these gaps.<n>First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats.<n>Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600
arXiv Detail & Related papers (2026-01-05T07:55:36Z)
MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring [39.776843923694244]
We propose MedKGI, a diagnostic framework grounded in clinical practices.<n>We show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency.
arXiv Detail & Related papers (2025-12-30T12:31:53Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Triplet-Structured Knowledge Integration for Multi-Turn Medical Reasoning [21.44813166265882]
Large Language Models (LLMs) have shown strong performance on static medical Question Answering (QA) tasks.<n>This paper introduces TriMediQ, a triplet-structured approach that enhances the reasoning reliability of LLMs.<n> Experiments on two interactive medical QA benchmarks show that TriMediQ achieves up to 10.4% improvement in accuracy over five existing baselines.
arXiv Detail & Related papers (2025-10-03T22:11:17Z)
MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering [4.285647375182588]
Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning.<n>Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge.<n>We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting.
arXiv Detail & Related papers (2025-08-20T05:43:26Z)
GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search [41.81463064393831]
Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages.<n>We propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data.<n>We construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy.
arXiv Detail & Related papers (2025-06-20T12:51:19Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning [14.366349078707263]
RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization. This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
arXiv Detail & Related papers (2024-02-19T06:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.