M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
- URL: http://arxiv.org/abs/2404.00578v1
- Date: Sun, 31 Mar 2024 06:55:12 GMT
- Title: M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
- Authors: Fan Bai, Yuxin Du, Tiejun Huang, Max Q. -H. Meng, Bo Zhao,
- Abstract summary: Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
- Score: 49.5030774873328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal large language models (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs specifically tailored for various 3D medical tasks, such as image-text retrieval, report generation, visual question answering, positioning, and segmentation. Additionally, we propose M3D-LaMed, a versatile multi-modal large language model for 3D medical image analysis. Furthermore, we introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks. Through comprehensive evaluation, our method proves to be a robust model for 3D medical image analysis, outperforming existing solutions. All code, data, and models are publicly available at: https://github.com/BAAI-DCAI/M3D.
Related papers
- Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model [16.93216342922561]
We propose Med-2E3, a novel MLLM for 3D medical image analysis that integrates 3D and 2D encoders.
To aggregate 2D features more effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions.
Experiments on a large-scale, open-source 3D medical multimodal benchmark demonstrate that Med-2E3 exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models.
arXiv Detail & Related papers (2024-11-19T09:59:59Z) - E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model [23.56751925900571]
The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment.
We utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features.
We apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity.
Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis.
arXiv Detail & Related papers (2024-10-18T06:31:40Z) - 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans.
Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images.
Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask.
By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z) - Med3DInsight: Enhancing 3D Medical Image Understanding with 2D
Multi-Modal Large Language Models [1.64647940449869]
Existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume.
We propose Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a Plane-Slice-Aware Transformer (PSAT) module.
arXiv Detail & Related papers (2024-03-08T08:15:53Z) - 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in
Radiology [6.851500027718433]
The field of 3D medical image retrieval is still emerging, lacking established evaluation benchmarks, comprehensive datasets, and thorough studies.
This paper introduces a novel benchmark for 3D Medical Image Retrieval (3D-MIR) that encompasses four different anatomies imaged with computed tomography.
Using this benchmark, we explore a diverse set of search strategies that use aggregated 2D slices, 3D volumes, and multi-modal embeddings from popular multi-modal foundation models as queries.
arXiv Detail & Related papers (2023-11-23T00:57:35Z) - JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues [68.76032126906743]
We introduce JM3D, a comprehensive approach integrating point cloud, text, and image.
Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text.
Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning.
arXiv Detail & Related papers (2023-10-14T06:13:20Z) - 3D Matting: A Soft Segmentation Method Applied in Computed Tomography [26.25446145993599]
Three-dimensional (3D) images, such as CT, MRI, and PET, are common in medical imaging applications and important in clinical diagnosis.
Semantic ambiguity is a typical feature of many medical image labels.
In 2D medical images, using soft masks instead of binary masks generated by image matting to characterize lesions can provide rich semantic information.
arXiv Detail & Related papers (2022-09-16T10:18:59Z) - MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D
Biomedical Image Classification [59.10015984688104]
MedMNIST v2 is a large-scale MNIST-like dataset collection of standardized biomedical images.
The resulting dataset consists of 708,069 2D images and 10,214 3D images in total.
arXiv Detail & Related papers (2021-10-27T22:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.