Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation
- URL: http://arxiv.org/abs/2107.07651v1
- Date: Fri, 16 Jul 2021 00:19:22 GMT
- Title: Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation
- Authors: Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq
Joty, Caiming Xiong, Steven Hoi
- Abstract summary: We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
- Score: 52.40490994871753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale vision and language representation learning has shown promising
improvements on various vision-language tasks. Most existing methods employ a
transformer-based multimodal encoder to jointly model visual tokens
(region-based image features) and word tokens. Because the visual tokens and
word tokens are unaligned, it is challenging for the multimodal encoder to
learn image-text interactions. In this paper, we introduce a contrastive loss
to ALign the image and text representations BEfore Fusing (ALBEF) them through
cross-modal attention, which enables more grounded vision and language
representation learning. Unlike most existing methods, our method does not
require bounding box annotations nor high-resolution images. In order to
improve learning from noisy web data, we propose momentum distillation, a
self-training method which learns from pseudo-targets produced by a momentum
model. We provide a theoretical analysis of ALBEF from a mutual information
maximization perspective, showing that different training tasks can be
interpreted as different ways to generate views for an image-text pair. ALBEF
achieves state-of-the-art performance on multiple downstream vision-language
tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained
on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves
absolute improvements of 2.37% and 3.84% compared to the state-of-the-art,
while enjoying faster inference speed. Code and pre-trained models are
available at https://github.com/salesforce/ALBEF/.
Related papers
- ViLTA: Enhancing Vision-Language Pre-training through Textual
Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs.
For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model.
For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.