Medical Image Understanding with Pretrained Vision Language Models: A
Comprehensive Study
- URL: http://arxiv.org/abs/2209.15517v1
- Date: Fri, 30 Sep 2022 15:06:13 GMT
- Title: Medical Image Understanding with Pretrained Vision Language Models: A
Comprehensive Study
- Authors: Ziyuan Qin, Huahui Yi, Qicheng Lao, Kang Li
- Abstract summary: We show that well-designed medical prompts are the key to elicit knowledge from pre-trained vision language models (VLM)
We develop three approaches for automatic generation of medical prompts, which can inject expert-level medical knowledge and image-specific information into the prompts for fine-grained grounding.
- Score: 8.547751745702156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The large-scale pre-trained vision language models (VLM) have shown
remarkable domain transfer capability on natural images. However, it remains
unknown whether this capability can also apply to the medical image domain.
This paper thoroughly studies the knowledge transferability of pre-trained VLMs
to the medical domain, where we show that well-designed medical prompts are the
key to elicit knowledge from pre-trained VLMs. We demonstrate that by prompting
with expressive attributes that are shared between domains, the VLM can carry
the knowledge across domains and improve its generalization. This mechanism
empowers VLMs to recognize novel objects with fewer or without image samples.
Furthermore, to avoid the laborious manual designing process, we develop three
approaches for automatic generation of medical prompts, which can inject
expert-level medical knowledge and image-specific information into the prompts
for fine-grained grounding. We conduct extensive experiments on thirteen
different medical datasets across various modalities, showing that our
well-designed prompts greatly improve the zero-shot performance compared to the
default prompts, and our fine-tuned models surpass the supervised models by a
significant margin.
Related papers
- VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge [33.25976241152384]
Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare.
In traditional computer vision tasks, creative or approximate answers may be acceptable, but in healthcare, precision is paramount.
This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models.
arXiv Detail & Related papers (2024-11-19T22:59:14Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert
knowledge in text supervision [17.583536041845402]
We present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding.
We compiled 37 open-access, mostly categorical fundus imaging datasets from various sources.
We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference.
arXiv Detail & Related papers (2023-08-15T17:39:52Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - Domain Generalization on Medical Imaging Classification using Episodic
Training with Task Augmentation [62.49837463676111]
We propose a novel scheme of episodic training with task augmentation on medical imaging classification.
Motivated by the limited number of source domains in real-world medical deployment, we consider the unique task-level overfitting.
arXiv Detail & Related papers (2021-06-13T03:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.