Generative Text-Guided 3D Vision-Language Pretraining for Unified
Medical Image Segmentation
- URL: http://arxiv.org/abs/2306.04811v1
- Date: Wed, 7 Jun 2023 22:20:51 GMT
- Title: Generative Text-Guided 3D Vision-Language Pretraining for Unified
Medical Image Segmentation
- Authors: Yinda Chen, Che Liu, Wei Huang, Sibo Cheng, Rossella Arcucci, Zhiwei
Xiong
- Abstract summary: We present Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image (GTGM)
GTGM generates medical-style text from 3D medical images without relying on paired descriptions.
Negative-free contrastive learning objective strategy is introduced to cultivate consistent visual representations between augmented 3D medical image patches.
- Score: 37.93699188912036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Pretraining (VLP) has demonstrated remarkable capabilities in
learning visual representations from textual descriptions of images without
annotations. Yet, effective VLP demands large-scale image-text pairs, a
resource that suffers scarcity in the medical domain. Moreover, conventional
VLP is limited to 2D images while medical images encompass diverse modalities,
often in 3D, making the learning process more challenging. To address these
challenges, we present Generative Text-Guided 3D Vision-Language Pretraining
for Unified Medical Image Segmentation (GTGM), a framework that extends of VLP
to 3D medical images without relying on paired textual descriptions.
Specifically, GTGM utilizes large language models (LLM) to generate
medical-style text from 3D medical images. This synthetic text is then used to
supervise 3D visual representation learning. Furthermore, a negative-free
contrastive learning objective strategy is introduced to cultivate consistent
visual representations between augmented 3D medical image patches, which
effectively mitigates the biases associated with strict positive-negative
sample pairings. We evaluate GTGM on three imaging modalities - Computed
Tomography (CT), Magnetic Resonance Imaging (MRI), and electron microscopy (EM)
over 13 datasets. GTGM's superior performance across various medical image
segmentation tasks underscores its effectiveness and versatility, by enabling
VLP extension into 3D medical imagery while bypassing the need for paired text.
Related papers
- ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue [25.398370966763597]
In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition.
Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones.
We propose ZALM3, a Zero-shot strategy to improve vision-language alignment in Multi-turn Multimodal Medical dialogue.
arXiv Detail & Related papers (2024-09-26T07:55:57Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z) - Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images.
Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask.
By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z) - T3D: Towards 3D Medical Image Understanding through Vision-Language
Pre-training [33.548818136506334]
We introduce T3D, the first framework designed for high-resolution 3D medical images.
T3D incorporates two text-informed pretext tasks: (lowerromannumeral1) text-informed contrastive learning; (lowerromannumeral2) text-informed image restoration.
T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification.
arXiv Detail & Related papers (2023-12-03T23:03:22Z) - Unified Medical Image Pre-training in Language-Guided Common Semantic Space [39.61770813855078]
We propose an Unified Medical Image Pre-training framework, namely UniMedI.
UniMedI uses diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images.
We evaluate its performance on both 2D and 3D images across 10 different datasets.
arXiv Detail & Related papers (2023-11-24T22:01:12Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.