VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
- URL: http://arxiv.org/abs/2206.08919v1
- Date: Fri, 17 Jun 2022 17:56:47 GMT
- Title: VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
- Authors: Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo
Yin, Ping Luo
- Abstract summary: This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
- Score: 59.25846149124199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing vision-language pre-training (VLP) methods primarily rely on paired
image-text datasets, which are either annotated by enormous human labors, or
crawled from the internet followed by elaborate data cleaning techniques. To
reduce the dependency on well-aligned image-text pairs, it is promising to
directly leverage the large-scale text-only and image-only corpora. This paper
proposes a data augmentation method, namely cross-modal CutMix (CMC), for
implicit cross-modal alignment learning in unpaired VLP. Specifically, CMC
transforms natural sentences from the textual view into a multi-modal view,
where visually-grounded words in a sentence are randomly replaced by diverse
image patches with similar semantics. There are several appealing proprieties
of the proposed CMC. First, it enhances the data diversity while keeping the
semantic meaning intact for tackling problems where the aligned data are
scarce; Second, by attaching cross-modal noise on uni-modal data, it guides
models to learn token-level interactions across modalities for better
denoising. Furthermore, we present a new unpaired VLP method, dubbed as
VLMixer, that integrates CMC with contrastive learning to pull together the
uni-modal and multi-modal views for better instance-level alignments among
different modalities. Extensive experiments on five downstream tasks show that
VLMixer could surpass previous state-of-the-art unpaired VLP methods.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Vision-Language Pre-Training with Triple Contrastive Learning [45.80365827890119]
We propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision.
Ours is the first work that takes into account local structure information for multi-modality representation learning.
arXiv Detail & Related papers (2022-02-21T17:54:57Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple
Levels [35.57369098866317]
Vision-language pre-training on large-scale image-text pairs has witnessed rapid progress for learning cross-modal representations.
We propose a new pre-training method which jointly aligns both the low-level and high-level semantics between image and text representations.
arXiv Detail & Related papers (2021-03-14T02:39:14Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.