Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime
- URL: http://arxiv.org/abs/2303.17644v1
- Date: Thu, 30 Mar 2023 18:20:00 GMT
- Title: Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime
- Authors: Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
- Abstract summary: This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
- Score: 70.04389979779195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper explores training medical vision-language models (VLMs) -- where
the visual and language inputs are embedded into a common space -- with a
particular focus on scenarios where training data is limited, as is often the
case in clinical datasets. We explore several candidate methods to improve
low-data performance, including: (i) adapting generic pre-trained models to
novel image and text domains (i.e. medical imaging and reports) via unimodal
self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE)
contrastive loss functions as well as a combination of the two; (iii) extra
supervision during VLM training, via: (a) image- and text-only
self-supervision, and (b) creating additional positive image-text pairs for
training through augmentation and nearest-neighbour search.
Using text-to-image retrieval as a benchmark, we evaluate the performance of
these methods with variable sized training datasets of paired chest X-rays and
radiological reports. Combined, they significantly improve retrieval compared
to fine-tuning CLIP, roughly equivalent to training with the data. A similar
pattern is found in the downstream task classification of CXR-related
conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM
benchmark, in the zero-shot and linear probing settings. We conclude with a set
of recommendations for researchers aiming to train vision-language models on
other medical imaging modalities when training data is scarce. To facilitate
further research, we will make our code and models publicly available.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen.
Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training [6.292642131180376]
In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt.
We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports.
Our model outperforms the state-of-the-art models trained under the same conditions.
arXiv Detail & Related papers (2023-10-20T05:44:55Z) - Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images.
Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - MedFMC: A Real-world Dataset and Benchmark For Foundation Model
Adaptation in Medical Image Classification [41.16626194300303]
Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications.
Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples.
Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks.
arXiv Detail & Related papers (2023-06-16T01:46:07Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays [10.398175542736285]
We introduce an image-text pre-training framework that can learn from mixed data inputs.
We demonstrate the feasibility of pre-training across mixed data inputs.
We also illustrate the benefits of adopting such pre-trained models in 3 chest X-ray applications.
arXiv Detail & Related papers (2021-03-30T01:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.