CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning
- URL: http://arxiv.org/abs/2508.03733v1
- Date: Thu, 31 Jul 2025 05:07:18 GMT
- Title: CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning
- Authors: Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang, Renxuan Li, Yang Xu, Jiawei Liu, Yao Zhang, Lei Liu, Carlos Gutiérrez SanRomán, Lei Wang,
- Abstract summary: We propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for chest X-ray (CXR) tasks.<n> CX-Mind is driven by curriculum reinforcement learning and verifiable process rewards (RL-VPR)<n>Experiments demonstrate that CX-Mind significantly outperforms existing medical and generaldomain MLLMs in visual understanding, text generation, and alignment.
- Score: 28.737391224748798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on "one-time" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.
Related papers
- Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings [1.515687944002438]
We propose CLARIFID, a novel framework that directly optimize diagnostic correctness by mirroring the two-step workflow of experts.<n> CLARIFID learns the logical flow from Findings to Impression through section-aware pretraining.<n>We show that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.
arXiv Detail & Related papers (2025-07-23T05:57:59Z) - RadFabric: Agentic AI System with Reasoning Capability for Radiology [61.25593938175618]
RadFabric is a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation.<n>System employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses.
arXiv Detail & Related papers (2025-06-17T03:10:33Z) - CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray [64.2434525370243]
The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays.<n>The CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings.<n>This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions.
arXiv Detail & Related papers (2025-06-09T17:53:31Z) - Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning [18.15610003617933]
We present CXRTrek, a new multi-stage visual question answering (VQA) dataset for chest X-ray (CXR) interpretation.<n>The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings.<n>We propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the framework.
arXiv Detail & Related papers (2025-05-29T06:30:40Z) - ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - A foundation model for generalizable disease diagnosis in chest X-ray images [40.9095393430871]
We introduce CXRBase, a foundational model designed to learn versatile representations from unlabelled CXR images.
CXRBase is trained on a substantial dataset of 1.04 million unlabelled CXR images.
It is fine-tuned with labeled data to enhance its performance in disease detection.
arXiv Detail & Related papers (2024-10-11T14:41:27Z) - DiCoM -- Diverse Concept Modeling towards Enhancing Generalizability in Chest X-Ray Studies [6.83819481805979]
Chest X-Ray (CXR) is a widely used clinical imaging modality.
Self-supervised pre-training has proven to outperform supervised pre-training in numerous downstream vision tasks.
We introduce Diverse Concept Modeling (DiCoM), a novel self-supervised training paradigm.
arXiv Detail & Related papers (2024-02-22T20:51:37Z) - Improving Classification Model Performance on Chest X-Rays through Lung
Segmentation [63.45024974079371]
We propose a deep learning approach to enhance abnormal chest x-ray (CXR) identification performance through segmentations.
Our approach is designed in a cascaded manner and incorporates two modules: a deep neural network with criss-cross attention modules (XLSor) for localizing lung region in CXR images and a CXR classification model with a backbone of a self-supervised momentum contrast (MoCo) model pre-trained on large-scale CXR data sets.
arXiv Detail & Related papers (2022-02-22T15:24:06Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.