RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text
Supervision
- URL: http://arxiv.org/abs/2401.10815v1
- Date: Fri, 19 Jan 2024 17:02:17 GMT
- Title: RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text
Supervision
- Authors: Fernando P\'erez-Garc\'ia, Harshita Sharma, Sam Bond-Taylor, Kenza
Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C.
Castro, Anton Schwaighofer, Matthew P. Lungren, Maria Wetscherek, Noel
Codella, Stephanie L. Hyland, Javier Alvarez-Valle, Ozan Oktay
- Abstract summary: Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images.
We introduce RAD-DINO, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data.
- Score: 44.00149519249467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-supervised pre-training has proven to be a valuable method for
extracting semantically meaningful features from images, serving as a
foundational element in multimodal systems within the computer vision and
medical imaging domains. However, resulting features are limited by the
information contained within the text. This is particularly problematic in
medical imaging, where radiologists' written findings focus on specific
observations; a challenge compounded by the scarcity of paired imaging-text
data due to concerns over leakage of personal health information. In this work,
we fundamentally challenge the prevailing reliance on language supervision for
learning general purpose biomedical imaging encoders. We introduce RAD-DINO, a
biomedical image encoder pre-trained solely on unimodal biomedical imaging data
that obtains similar or greater performance than state-of-the-art biomedical
language supervised models on a diverse range of benchmarks. Specifically, the
quality of learned representations is evaluated on standard imaging tasks
(classification and semantic segmentation), and a vision-language alignment
task (text report generation from images). To further demonstrate the drawback
of language supervision, we show that features from RAD-DINO correlate with
other medical records (e.g., sex or age) better than language-supervised
models, which are generally not mentioned in radiology reports. Finally, we
conduct a series of ablations determining the factors in RAD-DINO's
performance; notably, we observe that RAD-DINO's downstream performance scales
well with the quantity and diversity of training data, demonstrating that
image-only supervision is a scalable approach for training a foundational
biomedical image encoder.
Related papers
- Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation [27.05259342502574]
We present RadFound, a vision-language foundation model tailored for radiology.
It is trained on the most extensive dataset of over 8.1 million images and 250,000 image-text pairs.
To establish expert-level multimodal perception and generation capabilities, RadFound introduces an enhanced vision encoder.
arXiv Detail & Related papers (2024-09-24T15:31:49Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics [0.0]
Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image.
We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models.
The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction.
arXiv Detail & Related papers (2024-01-02T19:51:49Z) - Unified Medical Image Pre-training in Language-Guided Common Semantic Space [39.61770813855078]
We propose an Unified Medical Image Pre-training framework, namely UniMedI.
UniMedI uses diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images.
We evaluate its performance on both 2D and 3D images across 10 different datasets.
arXiv Detail & Related papers (2023-11-24T22:01:12Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Representative Image Feature Extraction via Contrastive Learning
Pretraining for Chest X-ray Report Generation [19.69560434388278]
The goal of medical report generation is to accurately capture and describe the image findings.
Previous works pretrain their visual encoding neural networks with large datasets in different domains.
We propose a framework that uses a contrastive learning approach to pretrain the visual encoder and requires no additional meta information.
arXiv Detail & Related papers (2022-09-04T12:07:19Z) - Self-supervised Multi-modal Training from Uncurated Image and Reports
Enables Zero-shot Oversight Artificial Intelligence in Radiology [31.045221580446963]
We present a model dubbed Medical Cross-attention Vision-Language model (Medical X-VL)
Our model enables various zero-shot tasks for oversight AI, ranging from the zero-shot classification to zero-shot error correction.
Our method was especially successful in the data-limited setting, suggesting the potential widespread applicability in medical domain.
arXiv Detail & Related papers (2022-08-10T04:35:58Z) - Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report
Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns.
ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.