Related papers: S-Chain: Structured Visual Chain-of-Thought For Medicine

S-Chain: Structured Visual Chain-of-Thought For Medicine

URL: http://arxiv.org/abs/2510.22728v1
Date: Sun, 26 Oct 2025 15:57:14 GMT
Title: S-Chain: Structured Visual Chain-of-Thought For Medicine
Authors: Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen,
Abstract summary: We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT)<n>The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability.<n>S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical vision-language models.
Score: 81.97605645734741
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.

Related papers

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs [60.93949629734977]
We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
arXiv Detail & Related papers (2026-03-03T03:18:31Z)
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z)
Enhancing Medical Large Vision-Language Models via Alignment Distillation [30.592211423687246]
We propose MEDALIGN to transfer visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training model to Med-LVLMs.<n>Experiments on medical report generation and medical visual question answering benchmarks show that MEDALIGN consistently improves both performance and interpretability.
arXiv Detail & Related papers (2025-12-21T00:57:13Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography [6.447908430647854]
We present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays.<n>We generate visual explanations using cross-attention and similarity-based localization maps.<n>We quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies.
arXiv Detail & Related papers (2025-10-22T13:52:19Z)
Think Twice to See More: Iterative Visual Reasoning in Medical VLMs [21.083636394814217]
We introduce ViTAR, a framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer"<n>ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning.
arXiv Detail & Related papers (2025-10-11T06:39:57Z)
Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z)
RAU: Reference-based Anatomical Understanding with Vision Language Models [26.06602931463068]
We introduce RAU, a framework for reference-based anatomical understanding with vision-language models (VLMs)<n>We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images.<n>Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2.
arXiv Detail & Related papers (2025-09-26T14:32:03Z)
Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning [34.6490677122246]
We show that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering.<n>We propose Consistency and Contrastive Learning (CCL), which integrates knowledge-anchored consistency learning and bias-aware contrastive learning.<n>CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set.
arXiv Detail & Related papers (2025-08-26T05:21:19Z)
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation [48.45209969191245]
Vision-language models (VLMs) provide semantic context through textual descriptions but lack explanation precision required.<n>We propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths.<n>Our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively, improving 3-5% over gaze baselines without increasing the annotation burden.
arXiv Detail & Related papers (2025-04-15T16:32:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.