Related papers: Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis

URL: http://arxiv.org/abs/2503.20047v1
Date: Tue, 25 Mar 2025 20:09:30 GMT
Title: Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
Authors: Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao,
Abstract summary: Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging.<n>We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations.<n>We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images.
Score: 6.464464511743737
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.

Related papers

Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging [19.44554736205812]
We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference.<n>A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement.<n>It improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation.<n>It reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis.
arXiv Detail & Related papers (2025-10-23T15:13:13Z)
3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering [52.01655676571933]
3D-MoRe is designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models.<n>The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder.<n>Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer pairs and 73,000 object descriptions.
arXiv Detail & Related papers (2025-07-16T08:38:26Z)
DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions [4.173963073705872]
We introduce DCFormer, an efficient 3D medical image encoder that factorizes 3D convolutions into three parallel 1D convolutions along depth, height, and width.<n>DCFormer achieves superior efficiency and accuracy, with DCFormer-Tiny reaching 62.0% accuracy and a 46.3% F1-score.
arXiv Detail & Related papers (2025-02-07T17:10:22Z)
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model [16.93216342922561]
We propose Med-2E3, a novel MLLM for 3D medical image analysis that integrates 3D and 2D encoders. To aggregate 2D features more effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. Experiments on a large-scale, open-source 3D medical multimodal benchmark demonstrate that Med-2E3 exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models.
arXiv Detail & Related papers (2024-11-19T09:59:59Z)
E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model [23.56751925900571]
The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. We utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. We apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis.
arXiv Detail & Related papers (2024-10-18T06:31:40Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs. We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z)
Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images. Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask. By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z)
T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency [32.57915952175522]
3D medical vision-language pre-training remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset.<n>To bridge this gap, we introduce **CT-3Dlots**, the first and largest **public** 3D volume-report dataset.<n>We propose the **T3D** framework, which enhances 3D medical image understanding beyond naive CLIP-style alignment.<n>Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities.
arXiv Detail & Related papers (2023-12-03T23:03:22Z)
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space. We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z)
MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification [59.10015984688104]
MedMNIST v2 is a large-scale MNIST-like dataset collection of standardized biomedical images. The resulting dataset consists of 708,069 2D images and 10,214 3D images in total.
arXiv Detail & Related papers (2021-10-27T22:02:04Z)
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans [72.04652116817238]
We propose a differentiable neural architecture search (DNAS) framework to automatically search for the 3D DL models for 3D chest CT scans classification. We also exploit the Class Activation Mapping (CAM) technique on our models to provide the interpretability of the results.
arXiv Detail & Related papers (2021-01-14T03:45:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.