Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge
- URL: http://arxiv.org/abs/2209.07118v1
- Date: Thu, 15 Sep 2022 08:00:01 GMT
- Title: Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge
- Authors: Zhihong Chen, Guanbin Li, Xiang Wan
- Abstract summary: We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
- Score: 68.90835997085557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical vision-and-language pre-training (Med-VLP) has received considerable
attention owing to its applicability to extracting generic vision-and-language
representations from medical images and texts. Most existing methods mainly
contain three elements: uni-modal encoders (i.e., a vision encoder and a
language encoder), a multi-modal fusion module, and pretext tasks, with few
studies considering the importance of medical domain expert knowledge and
explicitly exploiting such knowledge to facilitate Med-VLP. Although there
exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the
general domain, most require off-the-shelf toolkits (e.g., object detectors and
scene graph parsers), which are unavailable in the medical domain. In this
paper, we propose a systematic and effective approach to enhance Med-VLP by
structured medical knowledge from three perspectives. First, considering
knowledge can be regarded as the intermediate medium between vision and
language, we align the representations of the vision encoder and the language
encoder through knowledge. Second, we inject knowledge into the multi-modal
fusion model to enable the model to perform reasoning using knowledge as the
supplementation of the input image and text. Third, we guide the model to put
emphasis on the most critical information in images and texts by designing
knowledge-induced pretext tasks. To perform a comprehensive evaluation and
facilitate further research, we construct a medical vision-and-language
benchmark including three tasks. Experimental results illustrate the
effectiveness of our approach, where state-of-the-art performance is achieved
on all downstream tasks. Further analyses explore the effects of different
components of our approach and various settings of pre-training.
Related papers
- MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Knowledge Boosting: Rethinking Medical Contrastive Vision-Language
Pre-Training [6.582001681307021]
We propose the Knowledge-Boosting Contrastive Vision-Language Pre-training framework (KoBo)
KoBo integrates clinical knowledge into the learning of vision-language semantic consistency.
Experiments validate the effect of our framework on eight tasks including classification, segmentation, retrieval, and semantic relatedness.
arXiv Detail & Related papers (2023-07-14T09:38:22Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - Medical Image Understanding with Pretrained Vision Language Models: A
Comprehensive Study [8.547751745702156]
We show that well-designed medical prompts are the key to elicit knowledge from pre-trained vision language models (VLM)
We develop three approaches for automatic generation of medical prompts, which can inject expert-level medical knowledge and image-specific information into the prompts for fine-grained grounding.
arXiv Detail & Related papers (2022-09-30T15:06:13Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.