Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
- URL: http://arxiv.org/abs/2103.16022v1
- Date: Tue, 30 Mar 2021 01:48:46 GMT
- Title: Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
- Authors: Xiaosong Wang and Ziyue Xu and Leo Tam and Dong Yang and Daguang Xu
- Abstract summary: We introduce an image-text pre-training framework that can learn from mixed data inputs.
We demonstrate the feasibility of pre-training across mixed data inputs.
We also illustrate the benefits of adopting such pre-trained models in 3 chest X-ray applications.
- Score: 10.398175542736285
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-trained models, e.g., from ImageNet, have proven to be effective in
boosting the performance of many downstream applications. It is too demanding
to acquire large-scale annotations to build such models for medical imaging.
Meanwhile, there are numerous clinical data (in the form of images and text
reports) stored in the hospital information systems. The paired image-text data
from the same patient study could be utilized for the pre-training task in a
weakly supervised manner. However, the integrity, accessibility, and amount of
such raw data vary across different institutes, e.g., paired vs. unpaired
(image-only or text-only). In this work, we introduce an image-text
pre-training framework that can learn from these raw data with mixed data
inputs, i.e., paired image-text data, a mixture of paired and unpaired data.
The unpaired data can be sourced from one or multiple institutes (e.g., images
from one institute coupled with texts from another). Specifically, we propose a
transformer-based training framework for jointly learning the representation of
both the image and text data. In addition to the existing masked language
modeling, multi-scale masked vision modeling is introduced as a self-supervised
training task for image patch regeneration. We not only demonstrate the
feasibility of pre-training across mixed data inputs but also illustrate the
benefits of adopting such pre-trained models in 3 chest X-ray applications,
i.e., classification, retrieval, and image regeneration. Superior results are
reported in comparison to prior art using MIMIC-CXR, NIH14-CXR, and OpenI-CXR
datasets.
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training [6.292642131180376]
In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt.
We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports.
Our model outperforms the state-of-the-art models trained under the same conditions.
arXiv Detail & Related papers (2023-10-20T05:44:55Z) - Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images [9.86468773903613]
Medical Vision-Language Pre-training learns representations jointly from medical images and paired radiology reports.
We replace real medical images with their synthetic equivalents, generated from authentic medical reports.
Our empirical evaluation reveals that the performance achieved through synthetic data is on par with or even exceeds that obtained with real images.
arXiv Detail & Related papers (2023-10-10T21:29:41Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Self Pre-training with Masked Autoencoders for Medical Image
Classification and Segmentation [37.25161294917211]
Masked Autoencoder (MAE) has been shown to be effective in pre-training Vision Transformers (ViT) for natural image analysis.
We investigate a self pre-training paradigm with MAE for medical image analysis tasks.
arXiv Detail & Related papers (2022-03-10T16:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.