Related papers: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

URL: http://arxiv.org/abs/2506.16962v1
Date: Fri, 20 Jun 2025 12:51:19 GMT
Title: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Authors: Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang,
Abstract summary: We propose Mentor-Intern Collaborative Search (MICS) to generate rigorous and effective medical chain-of-thought data.<n>The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths.<n>Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy.
Score: 23.50838763761289
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Related papers

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models [26.152027922514957]
textscMedLA is a logic-driven multi-agent framework built on large language models.<n>Agents engage in a graph-guided discussion to compare and iteratively refine their logic trees.<n>We demonstrate that textscMedLA consistently outperforms both static role-based systems and single-agent baselines.
arXiv Detail & Related papers (2025-09-28T08:06:39Z)
Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning [15.73558614478585]
We introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning.<n>Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces.<n>Our model achieves state-of-the-art performance on both basic and complex reasoning tasks.
arXiv Detail & Related papers (2025-08-22T06:47:30Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning [20.767097964324172]
We propose PathCoT, a novel zero-shot chain-of-thought prompting method for visual reasoning tasks.<n>PathCoT guides the MLLM with prior knowledge to perform as pathology experts, and provides comprehensive analysis of the image with their domain-specific knowledge.<n>The experimental results on the PathMMU dataset demonstrate the effectiveness of our method on pathology visual understanding and reasoning.
arXiv Detail & Related papers (2025-06-18T09:20:23Z)
Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning [20.878972841860975]
We present Med-U1, a unified framework for robust reasoning across medical Question-Answering (QA) tasks.<n>With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains.<n> Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks.
arXiv Detail & Related papers (2025-06-14T02:00:36Z)
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [44.96018028534255]
ReasonMed is the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths.<n>We train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.
arXiv Detail & Related papers (2025-06-11T08:36:55Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment [6.022433954095106]
We introduce a medical instruction dataset, CMedINS, containing six medical instructions derived from actual medical tasks.<n>We then launch our medical model, IIMedGPT, employing an efficient preference alignment method.<n>The results show that our final model outperforms existing medical models in medical dialogue.
arXiv Detail & Related papers (2025-01-06T09:22:36Z)
LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering.<n>We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system.<n>Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z)
Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare. This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering [24.43605359639671]
We propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN. It contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers. We implement a prompt driven by Chain of Thought (CoT) reasoning, CLINICR, to mirror the prospective process of incremental reasoning.
arXiv Detail & Related papers (2024-03-07T20:48:40Z)
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning [14.366349078707263]
RJUA-MedDQA is a comprehensive benchmark in the field of medical specialization. This work introduces RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization.
arXiv Detail & Related papers (2024-02-19T06:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.