Related papers: MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

URL: http://arxiv.org/abs/2601.09879v1
Date: Wed, 14 Jan 2026 21:21:00 GMT
Title: MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
Authors: Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang, Wei Shao, Kuang Gong,
Abstract summary: We propose a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation.<n>MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging.<n>Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks.
Score: 11.762545584252052
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.

Related papers

3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis [42.29123264398027]
3DMedAgent is a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning.<n> Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs.
arXiv Detail & Related papers (2026-02-20T08:31:26Z)
SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation [0.30586855806896035]
We propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation.<n>SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture.
arXiv Detail & Related papers (2025-12-28T11:00:05Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models [5.020980730631682]
Existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension.<n>Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions.<n>We propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module.
arXiv Detail & Related papers (2025-09-11T00:12:59Z)
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine [11.993301266706139]
We propose a vision-language pre-training framework, termed as textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports.<n>Our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives.<n>The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks.
arXiv Detail & Related papers (2025-08-16T17:08:43Z)
Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset [56.533371387182065]
MV-ScanQA is a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views.<n>We present TripAlign, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text> triplets.<n>We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign.
arXiv Detail & Related papers (2025-08-14T20:35:59Z)
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation [40.73779035606757]
We introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation.<n>Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views.<n>MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases.
arXiv Detail & Related papers (2024-12-18T07:19:48Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency [32.57915952175522]
3D medical vision-language pre-training remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset.<n>To bridge this gap, we introduce **CT-3Dlots**, the first and largest **public** 3D volume-report dataset.<n>We propose the **T3D** framework, which enhances 3D medical image understanding beyond naive CLIP-style alignment.<n>Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities.
arXiv Detail & Related papers (2023-12-03T23:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.