Few-shot Adaptation of Medical Vision-Language Models
- URL: http://arxiv.org/abs/2409.03868v1
- Date: Thu, 5 Sep 2024 19:10:29 GMT
- Title: Few-shot Adaptation of Medical Vision-Language Models
- Authors: Fereshteh Shakeri, Yunshi Huang, Julio Silva-RodrÃguez, Houda Bahig, An Tang, Jose Dolz, Ismail Ben Ayed,
- Abstract summary: We introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime.
We evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings.
Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies.
- Score: 17.11090825001394
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \url{https://github.com/FereshteShakeri/few-shot-MedVLMs}.
Related papers
- BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models [2.2585213273821716]
We propose a novel prompt learning framework for accurate and generalizable few-shot biomedical image classification.
Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy.
We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability.
arXiv Detail & Related papers (2024-11-21T19:13:04Z) - TransMed: Large Language Models Enhance Vision Transformer for
Biomedical Image Classification [11.202967500669402]
Few-shot learning has been studied to adapt models to tasks with very few samples.
We propose a novel approach that contextualizes labels via large language models (LLMs)
Our findings reveal that the context generated by LLMs significantly enhances the discrimination of semantic embeddings for similar categories.
arXiv Detail & Related papers (2023-12-12T09:58:07Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - MedFMC: A Real-world Dataset and Benchmark For Foundation Model
Adaptation in Medical Image Classification [41.16626194300303]
Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications.
Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples.
Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks.
arXiv Detail & Related papers (2023-06-16T01:46:07Z) - Domain Generalization for Mammographic Image Analysis with Contrastive
Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities.
A novel contrastive learning is developed to equip the deep learning models with better style generalization capability.
The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Understanding the Tricks of Deep Learning in Medical Image Segmentation:
Challenges and Future Directions [66.40971096248946]
In this paper, we collect a series of MedISeg tricks for different model implementation phases.
We experimentally explore the effectiveness of these tricks on consistent baselines.
We also open-sourced a strong MedISeg repository, where each component has the advantage of plug-and-play.
arXiv Detail & Related papers (2022-09-21T12:30:05Z) - Domain Generalization for Mammography Detection via Multi-style and
Multi-view Contrastive Learning [47.30824944649112]
A new contrastive learning scheme is developed to augment the generalization capability of deep learning model to various vendors with limited resources.
The backbone network is trained with a multi-style and multi-view unsupervised self-learning scheme for the embedding of invariant features to various vendor-styles.
The experimental results suggest that our approach can effectively improve detection performance on both seen and unseen domains.
arXiv Detail & Related papers (2021-11-21T14:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.