Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training
- URL: http://arxiv.org/abs/2209.07098v1
- Date: Thu, 15 Sep 2022 07:26:43 GMT
- Title: Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training
- Authors: Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan,
Tsung-Hui Chang
- Abstract summary: We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
- Score: 62.215025958347105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical vision-and-language pre-training provides a feasible solution to
extract effective vision-and-language representations from medical images and
texts. However, few studies have been dedicated to this field to facilitate
medical vision-and-language understanding. In this paper, we propose a
self-supervised learning paradigm with multi-modal masked autoencoders
(M$^3$AE), which learn cross-modal domain knowledge by reconstructing missing
pixels and tokens from randomly masked images and texts. There are three key
designs to make this simple approach work. First, considering the different
information densities of vision and language, we adopt different masking ratios
for the input image and text, where a considerably larger masking ratio is used
for images. Second, we use visual and textual features from different layers to
perform the reconstruction to deal with different levels of abstraction in
visual and language. Third, we develop different designs for vision and
language decoders (i.e., a Transformer for vision and a multi-layer perceptron
for language). To perform a comprehensive evaluation and facilitate further
research, we construct a medical vision-and-language benchmark including three
tasks. Experimental results demonstrate the effectiveness of our approach,
where state-of-the-art results are achieved on all downstream tasks. Besides,
we conduct further analysis to better verify the effectiveness of different
components of our approach and various settings of pre-training. The source
code is available at~\url{https://github.com/zhjohnchan/M3AE}.
Related papers
- Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Multi-modal Understanding and Generation for Medical Images and Text via
Vision-Language Pre-Training [5.119201893752376]
We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme.
We empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.
arXiv Detail & Related papers (2021-05-24T15:14:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.