Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training
- URL: http://arxiv.org/abs/2106.13488v2
- Date: Mon, 28 Jun 2021 04:42:48 GMT
- Title: Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training
- Authors: Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang
Li, Jiebo Luo
- Abstract summary: Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
- Score: 139.4566371416662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Pre-training (VLP) aims to learn multi-modal representations
from image-text pairs and serves for downstream vision-language tasks in a
fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer
architecture, which embeds images with a CNN, and then aligns images and text
with a Transformer. Visual relationship between visual contents plays an
important role in image understanding and is the basic for inter-modal
alignment learning. However, CNNs have limitations in visual relation learning
due to local receptive field's weakness in modeling long-range dependencies.
Thus the two objectives of learning visual relation and inter-modal alignment
are encapsulated in the same Transformer network. Such design might restrict
the inter-modal alignment learning in the Transformer by ignoring the
specialized characteristic of each objective. To tackle this, we propose a
fully Transformer visual embedding for VLP to better learn visual relation and
further promote inter-modal alignment. Specifically, we propose a metric named
Inter-Modality Flow (IMF) to measure the interaction between vision and
language modalities (i.e., inter-modality). We also design a novel masking
optimization mechanism named Masked Feature Regression (MFR) in Transformer to
further promote the inter-modality learning. To the best of our knowledge, this
is the first study to explore the benefit of Transformer for visual feature
learning in VLP. We verify our method on a wide range of vision-language tasks,
including Image-Text Retrieval, Visual Question Answering (VQA), Visual
Entailment and Visual Reasoning. Our approach not only outperforms the
state-of-the-art VLP performance, but also shows benefits on the IMF metric.
Related papers
- APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning [19.73126931526359]
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
We first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements.
We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video.
arXiv Detail & Related papers (2022-11-28T07:39:20Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers [47.581265194864585]
Internal mechanisms of vision and multimodal transformers remain largely opaque.
With the success of these transformers, it is increasingly critical to understand their inner workings.
We propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
arXiv Detail & Related papers (2022-03-30T05:25:35Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.