Related papers: Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

URL: http://arxiv.org/abs/2303.17644v1
Date: Thu, 30 Mar 2023 18:20:00 GMT
Title: Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime
Authors: Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
Abstract summary: This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
Score: 70.04389979779195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores training medical vision-language models (VLMs) -- where the visual and language inputs are embedded into a common space -- with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text domains (i.e. medical imaging and reports) via unimodal self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE) contrastive loss functions as well as a combination of the two; (iii) extra supervision during VLM training, via: (a) image- and text-only self-supervision, and (b) creating additional positive image-text pairs for training through augmentation and nearest-neighbour search. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports. Combined, they significantly improve retrieval compared to fine-tuning CLIP, roughly equivalent to training with the data. A similar pattern is found in the downstream task classification of CXR-related conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM benchmark, in the zero-shot and linear probing settings. We conclude with a set of recommendations for researchers aiming to train vision-language models on other medical imaging modalities when training data is scarce. To facilitate further research, we will make our code and models publicly available.

Related papers

CLIP-IT: CLIP-based Pairing for Histology Images Classification [6.855390956571216]
We introduce CLIP-IT to train a vision backbone model to classify histology images by pairing them with privileged textual information from an external source. At first, the modality pairing step relies on a CLIP-based model to match histology images with semantically relevant textual report data from external sources, creating an augmented multimodal dataset. A parameter-efficient fine-tuning method is used to efficiently address the misalignment between the main (image) and paired (text) modalities.
arXiv Detail & Related papers (2025-04-22T18:14:43Z)
UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation [31.72930277939111]
We propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task.
arXiv Detail & Related papers (2025-03-20T08:28:53Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Visual Prompt Engineering for Vision Language Models in Radiology [0.17183214167143138]
Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining.<n>While CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy.<n>We explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention.
arXiv Detail & Related papers (2024-08-28T13:53:27Z)
Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen. Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training [6.292642131180376]
In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt. We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports. Our model outperforms the state-of-the-art models trained under the same conditions.
arXiv Detail & Related papers (2023-10-20T05:44:55Z)
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images. Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification [41.16626194300303]
Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications. Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples. Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks.
arXiv Detail & Related papers (2023-06-16T01:46:07Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Generative Negative Text Replay for Continual Vision-Language Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently. Massive data are usually collected in a streaming fashion. We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z)
Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays [10.398175542736285]
We introduce an image-text pre-training framework that can learn from mixed data inputs. We demonstrate the feasibility of pre-training across mixed data inputs. We also illustrate the benefits of adopting such pre-trained models in 3 chest X-ray applications.
arXiv Detail & Related papers (2021-03-30T01:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.