Related papers: Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

URL: http://arxiv.org/abs/2403.05141v1
Date: Fri, 8 Mar 2024 08:15:53 GMT
Title: Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models
Authors: Qiuhui Chen, Huping Ye, Yi Hong
Abstract summary: Existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume. We propose Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a Plane-Slice-Aware Transformer (PSAT) module.
Score: 1.64647940449869
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding 3D medical image volumes is a critical task in the medical domain. However, existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume and also need a large set of volumes for training. Recent advances in multi-modal large language models (MLLMs) provide a new and promising way to understand images with the help of text descriptions. However, most current MLLMs are designed for 2D natural images. To enhance the 3D medical image understanding with 2D MLLMs, we propose a novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module. Extensive experiments demonstrate our SOTA performance on two downstream segmentation and classification tasks, including three public datasets with CT and MRI modalities and comparison to more than ten baselines. Med3DInsight can be easily integrated into any current 3D medical image understanding network and improves its performance by a good margin.

Related papers

Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning. UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z)
DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions [6.464464511743737]
We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. In zero-shot and fine-tuned detection of 18 pathologies, DCFormer consistently outperforms state-of-the-art 3D vision encoders.
arXiv Detail & Related papers (2025-02-07T17:10:22Z)
3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow [69.94527569577295]
3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world. Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum. We propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing.
arXiv Detail & Related papers (2025-01-28T04:31:19Z)
Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation [40.73779035606757]
We introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation. Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views. MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases.
arXiv Detail & Related papers (2024-12-18T07:19:48Z)
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model [16.93216342922561]
We propose Med-2E3, a novel MLLM for 3D medical image analysis that integrates 3D and 2D encoders. To aggregate 2D features more effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. Experiments on a large-scale, open-source 3D medical multimodal benchmark demonstrate that Med-2E3 exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models.
arXiv Detail & Related papers (2024-11-19T09:59:59Z)
Cross-D Conv: Cross-Dimensional Transferable Knowledge Base via Fourier Shifting Operation [3.69758875412828]
Cross-D Conv operation bridges the dimensional gap by learning the phase shifting in the Fourier domain. Our method enables seamless weight transfer between 2D and 3D convolution operations, effectively facilitating cross-dimensional learning.
arXiv Detail & Related papers (2024-11-02T13:03:44Z)
M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs. We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z)
Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images. Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask. By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z)
TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters. TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z)
T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training [33.548818136506334]
We introduce T3D, the first framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (lowerromannumeral1) text-informed contrastive learning; (lowerromannumeral2) text-informed image restoration. T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification.
arXiv Detail & Related papers (2023-12-03T23:03:22Z)
JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues [68.76032126906743]
We introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning.
arXiv Detail & Related papers (2023-10-14T06:13:20Z)
Spatiotemporal Modeling Encounters 3D Medical Image Analysis: Slice-Shift UNet with Multi-View Fusion [0.0]
We propose a new 2D-based model dubbed Slice SHift UNet which encodes three-dimensional features at 2D CNN's complexity. More precisely multi-view features are collaboratively learned by performing 2D convolutions along the three planes of a volume. The effectiveness of our approach is validated in Multi-Modality Abdominal Multi-Organ axis (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets.
arXiv Detail & Related papers (2023-07-24T14:53:23Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding [96.95120198412395]
We introduce tri-modal pre-training framework that automatically generates holistic language descriptions for 3D shapes. It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets. We conduct experiments on two large-scale 3D datasets, NN and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, captioning, and language for training. Experiments show that NN-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with finetuning, and 3D (3D
arXiv Detail & Related papers (2023-05-14T23:14:09Z)
MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification [59.10015984688104]
MedMNIST v2 is a large-scale MNIST-like dataset collection of standardized biomedical images. The resulting dataset consists of 708,069 2D images and 10,214 3D images in total.
arXiv Detail & Related papers (2021-10-27T22:02:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.