Related papers: Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

URL: http://arxiv.org/abs/2006.06195v2
Date: Thu, 22 Oct 2020 18:12:53 GMT
Title: Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Authors: Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
Abstract summary: We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning.
Score: 81.76089876263175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

Related papers

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? [48.41029452721923]
Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning. Their ability to apply multi-step reasoning to images has lagged, giving rise to perceptions of modality imbalance or brittleness.
arXiv Detail & Related papers (2025-01-05T21:36:38Z)
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training. It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z)
TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications. We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z)
Advancing Prompt Learning through an External Layer [24.77977865016954]
We propose a paradigm called EnPrompt with a novel External Layer (EnLa) The learnable external layer is built upon valid embeddings of pre-trained CLIP. Four experiments demonstrate that our method outperforms the existing prompt learning method.
arXiv Detail & Related papers (2024-07-29T03:30:09Z)
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE [66.48689706116808]
Efficient Vision-languagE is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Eve encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts. Eve achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
arXiv Detail & Related papers (2023-08-23T07:36:30Z)
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together. Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language. We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language. We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation. We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.