ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
- URL: http://arxiv.org/abs/2603.01124v1
- Date: Sun, 01 Mar 2026 14:15:54 GMT
- Title: ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models
- Authors: Xiwei Liu, Yulong Li, Xinlin Zhuang, Xuhui Li, Jianxu Chen, Haolin Yang, Imran Razzak, Yutong Xie,
- Abstract summary: We propose ClinCoT to transform preference optimization from response-level correction to visual-driven reasoning.<n>We show that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
- Score: 24.19721015692576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
Related papers
- MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling [51.31633278218137]
MedScope is a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures.<n>We build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite.<n>On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-11T09:47:02Z) - S-Chain: Structured Visual Chain-of-Thought For Medicine [81.97605645734741]
We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT)<n>The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability.<n>S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical vision-language models.
arXiv Detail & Related papers (2025-10-26T15:57:14Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards [9.525090594500577]
We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation.<n>Our method directly optimize factual grounding and completeness without training a separate reward model or relying on human-authored references.<n>The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.
arXiv Detail & Related papers (2025-09-26T17:53:08Z) - Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning [6.778254993886297]
We introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations.<n>First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis.<n>Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models.<n>Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards framework.
arXiv Detail & Related papers (2025-09-18T13:35:14Z) - RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning [5.493679122639688]
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference.<n>We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement.
arXiv Detail & Related papers (2025-08-31T19:38:25Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.