Related papers: A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

URL: http://arxiv.org/abs/2504.05227v1
Date: Mon, 07 Apr 2025 16:13:26 GMT
Title: A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?
Authors: Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed,
Abstract summary: Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources.<n>This paper revisits supervised, unimodal pre-training, using fine-grained labels instead.<n>We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources.
Score: 20.94974284175104
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.

Related papers

CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting [0.0]
We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation. We find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. We develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools.
arXiv Detail & Related papers (2024-07-11T18:39:19Z)
Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis [53.809054774037214]
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports. It is the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations.
arXiv Detail & Related papers (2024-05-14T19:53:20Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training [6.582001681307021]
We propose the Knowledge-Boosting Contrastive Vision-Language Pre-training framework (KoBo) KoBo integrates clinical knowledge into the learning of vision-language semantic consistency. Experiments validate the effect of our framework on eight tasks including classification, segmentation, retrieval, and semantic relatedness.
arXiv Detail & Related papers (2023-07-14T09:38:22Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)
Self-supervised Multi-modal Training from Uncurated Image and Reports Enables Zero-shot Oversight Artificial Intelligence in Radiology [31.045221580446963]
We present a model dubbed Medical Cross-attention Vision-Language model (Medical X-VL) Our model enables various zero-shot tasks for oversight AI, ranging from the zero-shot classification to zero-shot error correction. Our method was especially successful in the data-limited setting, suggesting the potential widespread applicability in medical domain.
arXiv Detail & Related papers (2022-08-10T04:35:58Z)
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing [17.96645738679543]
We show that textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing. We propose a self-supervised joint vision--language approach with a focus on better text modelling.
arXiv Detail & Related papers (2022-04-21T00:04:35Z)
A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition. We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.