Related papers: 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis

URL: http://arxiv.org/abs/2602.18064v1
Date: Fri, 20 Feb 2026 08:31:26 GMT
Title: 3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis
Authors: Ziyue Wang, Linghan Cai, Chang Han Low, Haofeng Liu, Junde Wu, Jingyu Wang, Rui Wang, Lei Song, Jiang Bian, Jingjing Fu, Yueming Jin,
Abstract summary: 3DMedAgent is a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning.<n> Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs.
Score: 42.29123264398027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical assistants.Code and data are available at \href{https://github.com/jinlab-imvr/3DMedAgent}{https://github.com/jinlab-imvr/3DMedAgent}.

Related papers

Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models [0.0]
We introduce a fully training-free framework for ZSAD in 3D brain MRI.<n>The framework constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models.<n>These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines.
arXiv Detail & Related papers (2026-02-17T02:46:45Z)
MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation [11.762545584252052]
We propose a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation.<n>MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging.<n>Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks.
arXiv Detail & Related papers (2026-01-14T21:21:00Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models [5.020980730631682]
Existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension.<n>Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions.<n>We propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module.
arXiv Detail & Related papers (2025-09-11T00:12:59Z)
MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training [7.968487067774351]
3D medical image analysis is pivotal in numerous clinical applications.<n>Large-scale vision-language pre-training remains underexplored in 3D medical image analysis.<n>We propose MG-3D, pre-trained on large-scale data (47.1K)
arXiv Detail & Related papers (2024-12-08T09:45:59Z)
Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z)
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model [17.69323209661274]
We propose Med-2E3, a 3D medical MLLM that integrates a dual 3D-2D encoder architecture.<n>To aggregate 2D features effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module.<n>Experiments on large-scale, open-source 3D medical multimodal datasets demonstrate that TG-IS exhibits task-specific attention distribution.
arXiv Detail & Related papers (2024-11-19T09:59:59Z)
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [130.40123493752816]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.<n>Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)<n>It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z)
M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs. We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency [32.57915952175522]
3D medical vision-language pre-training remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset.<n>To bridge this gap, we introduce **CT-3Dlots**, the first and largest **public** 3D volume-report dataset.<n>We propose the **T3D** framework, which enhances 3D medical image understanding beyond naive CLIP-style alignment.<n>Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities.
arXiv Detail & Related papers (2023-12-03T23:03:22Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.