Large-Scale Adversarial Training for Vision-and-Language Representation
Learning
- URL: http://arxiv.org/abs/2006.06195v2
- Date: Thu, 22 Oct 2020 18:12:53 GMT
- Title: Large-Scale Adversarial Training for Vision-and-Language Representation
Learning
- Authors: Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
- Abstract summary: We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning.
- Score: 81.76089876263175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present VILLA, the first known effort on large-scale adversarial training
for vision-and-language (V+L) representation learning. VILLA consists of two
training stages: (i) task-agnostic adversarial pre-training; followed by (ii)
task-specific adversarial finetuning. Instead of adding adversarial
perturbations on image pixels and textual tokens, we propose to perform
adversarial training in the embedding space of each modality. To enable
large-scale training, we adopt the "free" adversarial training strategy, and
combine it with KL-divergence-based regularization to promote higher invariance
in the embedding space. We apply VILLA to current best-performing V+L models,
and achieve new state of the art on a wide range of tasks, including Visual
Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval,
Referring Expression Comprehension, Visual Entailment, and NLVR2.
Related papers
- COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.
It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.
It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together.
Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks.
We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.