E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
- URL: http://arxiv.org/abs/2410.14200v1
- Date: Fri, 18 Oct 2024 06:31:40 GMT
- Title: E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
- Authors: Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, S. Kevin Zhou,
- Abstract summary: The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment.
We utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features.
We apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity.
Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis.
- Score: 23.56751925900571
- License:
- Abstract: The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.
Related papers
- 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans.
Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z) - Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images.
Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask.
By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z) - T3D: Towards 3D Medical Image Understanding through Vision-Language
Pre-training [33.548818136506334]
We introduce T3D, the first framework designed for high-resolution 3D medical images.
T3D incorporates two text-informed pretext tasks: (lowerromannumeral1) text-informed contrastive learning; (lowerromannumeral2) text-informed image restoration.
T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification.
arXiv Detail & Related papers (2023-12-03T23:03:22Z) - Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans [6.936271803454143]
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG)
We created RIORefer, a large-scale 3D visual grounding dataset.
It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan.
arXiv Detail & Related papers (2023-05-23T09:52:49Z) - Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks.
These 2D models have been surpassed by 3D models on 3D computer vision benchmarks.
We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z) - Oral-3Dv2: 3D Oral Reconstruction from Panoramic X-Ray Imaging with
Implicit Neural Representation [3.8215162658168524]
Oral-3Dv2 is a non-adversarial-learning-based model in 3D radiology reconstruction from a single panoramic X-ray image.
Our model learns to represent the 3D oral structure in an implicit way by mapping 2D coordinates into density values of voxels in the 3D space.
To the best of our knowledge, this is the first work of a non-adversarial-learning-based model in 3D radiology reconstruction from a single panoramic X-ray image.
arXiv Detail & Related papers (2023-03-21T18:17:27Z) - 3D Matting: A Soft Segmentation Method Applied in Computed Tomography [26.25446145993599]
Three-dimensional (3D) images, such as CT, MRI, and PET, are common in medical imaging applications and important in clinical diagnosis.
Semantic ambiguity is a typical feature of many medical image labels.
In 2D medical images, using soft masks instead of binary masks generated by image matting to characterize lesions can provide rich semantic information.
arXiv Detail & Related papers (2022-09-16T10:18:59Z) - Automated Model Design and Benchmarking of 3D Deep Learning Models for
COVID-19 Detection with Chest CT Scans [72.04652116817238]
We propose a differentiable neural architecture search (DNAS) framework to automatically search for the 3D DL models for 3D chest CT scans classification.
We also exploit the Class Activation Mapping (CAM) technique on our models to provide the interpretability of the results.
arXiv Detail & Related papers (2021-01-14T03:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.