Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
- URL: http://arxiv.org/abs/2509.19090v2
- Date: Wed, 24 Sep 2025 08:19:50 GMT
- Title: Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
- Authors: Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang,
- Abstract summary: We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning.<n>The model integrates detection, segmentation, and multimodal chain-of-thought reasoning.<n>It supports pixel-level lesion localization, structured report generation, and physician-like diagnostic inference.
- Score: 13.783146290218738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
Related papers
- AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z) - A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning [24.842846823884557]
We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning.<n>We show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings.
arXiv Detail & Related papers (2025-12-25T09:01:06Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation [0.0]
This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs)<n>The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound.
arXiv Detail & Related papers (2025-09-16T23:15:44Z) - MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text [25.102399692530245]
We introduce MedAtlas, a novel benchmark framework to evaluate large language models on realistic medical reasoning tasks.<n> MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity.<n>Each case is derived from real diagnostic and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray.
arXiv Detail & Related papers (2025-08-13T17:32:17Z) - Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models [28.34025112894094]
The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models.<n>The survey critically examines important datasets, evaluation metrics, and methodological innovations.<n>It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations.
arXiv Detail & Related papers (2025-07-31T10:49:21Z) - Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images [1.4680035572775536]
Vision-language models have emerged as a powerful tool for challenging multi-modal classification problem in the medical domain.
Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions.
In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images.
arXiv Detail & Related papers (2024-05-31T09:59:11Z) - Integrating Medical Imaging and Clinical Reports Using Multimodal Deep Learning for Advanced Disease Analysis [3.8758525789991896]
An innovative multi-modal deep learning model is proposed to deeply integrate heterogeneous information from medical images and clinical reports.
For medical images, convolutional neural networks were used to extract high-dimensional features and capture key visual information.
For clinical report text, a two-way long and short-term memory network combined with an attention mechanism is used for deep semantic understanding.
arXiv Detail & Related papers (2024-05-23T02:22:10Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - A Transformer-based representation-learning model with unified
processing of multimodal input for clinical diagnostics [63.106382317917344]
We report a Transformer-based representation-learning model as a clinical diagnostic aid that processes multimodal input in a unified manner.
The unified model outperformed an image-only model and non-unified multimodal diagnosis models in the identification of pulmonary diseases.
arXiv Detail & Related papers (2023-06-01T16:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.