T3D: Towards 3D Medical Image Understanding through Vision-Language
Pre-training
- URL: http://arxiv.org/abs/2312.01529v2
- Date: Tue, 5 Dec 2023 09:01:07 GMT
- Title: T3D: Towards 3D Medical Image Understanding through Vision-Language
Pre-training
- Authors: Che Liu, Cheng Ouyang, Yinda Chen, Cesar C\'esar Quilodr\'an-Casas,
Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci
- Abstract summary: We introduce T3D, the first framework designed for high-resolution 3D medical images.
T3D incorporates two text-informed pretext tasks: (lowerromannumeral1) text-informed contrastive learning; (lowerromannumeral2) text-informed image restoration.
T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification.
- Score: 33.548818136506334
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Expert annotation of 3D medical image for downstream analysis is
resource-intensive, posing challenges in clinical applications. Visual
self-supervised learning (vSSL), though effective for learning visual
invariance, neglects the incorporation of domain knowledge from medicine. To
incorporate medical knowledge into visual representation learning,
vision-language pre-training (VLP) has shown promising results in 2D image.
However, existing VLP approaches become generally impractical when applied to
high-resolution 3D medical images due to GPU hardware constraints and the
potential loss of critical details caused by downsampling, which is the
intuitive solution to hardware constraints. To address the above limitations,
we introduce T3D, the first VLP framework designed for high-resolution 3D
medical images. T3D incorporates two text-informed pretext tasks:
(\lowerromannumeral{1}) text-informed contrastive learning;
(\lowerromannumeral{2}) text-informed image restoration. These tasks focus on
learning 3D visual representations from high-resolution 3D medical images and
integrating clinical knowledge from radiology reports, without distorting
information through forced alignment of downsampled volumes with detailed
anatomical text. Trained on a newly curated large-scale dataset of 3D medical
images and radiology reports, T3D significantly outperforms current vSSL
methods in tasks like organ and tumor segmentation, as well as disease
classification. This underlines T3D's potential in representation learning for
3D medical image analysis. All data and code will be available upon acceptance.
Related papers
- E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model [23.56751925900571]
The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment.
We utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features.
We apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity.
Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis.
arXiv Detail & Related papers (2024-10-18T06:31:40Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z) - Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images.
Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask.
By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z) - Multi-View Vertebra Localization and Identification from CT Images [57.56509107412658]
We propose a multi-view vertebra localization and identification from CT images.
We convert the 3D problem into a 2D localization and identification task on different views.
Our method can learn the multi-view global information naturally.
arXiv Detail & Related papers (2023-07-24T14:43:07Z) - Generative Text-Guided 3D Vision-Language Pretraining for Unified
Medical Image Segmentation [37.93699188912036]
We present Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image (GTGM)
GTGM generates medical-style text from 3D medical images without relying on paired descriptions.
Negative-free contrastive learning objective strategy is introduced to cultivate consistent visual representations between augmented 3D medical image patches.
arXiv Detail & Related papers (2023-06-07T22:20:51Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - 3D Matting: A Benchmark Study on Soft Segmentation Method for Pulmonary
Nodules Applied in Computed Tomography [32.775884701366465]
We introduce the image matting into the 3D scenes and use the alpha matte, i.e., a soft mask, to describe lesions in a 3D medical image.
To address this issue, we conduct a comprehensive study of 3D matting, including both traditional and deep-learning-based methods.
We propose the first end-to-end deep 3D matting network and implement a solid 3D medical image matting benchmark.
arXiv Detail & Related papers (2022-10-11T02:40:18Z) - 3D Matting: A Soft Segmentation Method Applied in Computed Tomography [26.25446145993599]
Three-dimensional (3D) images, such as CT, MRI, and PET, are common in medical imaging applications and important in clinical diagnosis.
Semantic ambiguity is a typical feature of many medical image labels.
In 2D medical images, using soft masks instead of binary masks generated by image matting to characterize lesions can provide rich semantic information.
arXiv Detail & Related papers (2022-09-16T10:18:59Z) - 3D Self-Supervised Methods for Medical Imaging [7.65168530693281]
We propose 3D versions for five different self-supervised methods, in the form of proxy tasks.
Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation.
The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks.
arXiv Detail & Related papers (2020-06-06T09:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.