MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
- URL: http://arxiv.org/abs/2601.09879v1
- Date: Wed, 14 Jan 2026 21:21:00 GMT
- Title: MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
- Authors: Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang, Wei Shao, Kuang Gong,
- Abstract summary: We propose a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation.<n>MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging.<n>Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks.
- Score: 11.762545584252052
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.
Related papers
- 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis [42.29123264398027]
3DMedAgent is a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning.<n> Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs.
arXiv Detail & Related papers (2026-02-20T08:31:26Z) - SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation [0.30586855806896035]
We propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation.<n>SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture.
arXiv Detail & Related papers (2025-12-28T11:00:05Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models [5.020980730631682]
Existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension.<n>Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions.<n>We propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module.
arXiv Detail & Related papers (2025-09-11T00:12:59Z) - VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine [11.993301266706139]
We propose a vision-language pre-training framework, termed as textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports.<n>Our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives.<n>The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks.
arXiv Detail & Related papers (2025-08-16T17:08:43Z) - Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset [56.533371387182065]
MV-ScanQA is a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views.<n>We present TripAlign, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text> triplets.<n>We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign.
arXiv Detail & Related papers (2025-08-14T20:35:59Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation [40.73779035606757]
We introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation.<n>Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views.<n>MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases.
arXiv Detail & Related papers (2024-12-18T07:19:48Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency [32.57915952175522]
3D medical vision-language pre-training remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset.<n>To bridge this gap, we introduce **CT-3Dlots**, the first and largest **public** 3D volume-report dataset.<n>We propose the **T3D** framework, which enhances 3D medical image understanding beyond naive CLIP-style alignment.<n>Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities.
arXiv Detail & Related papers (2023-12-03T23:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.