Scaling Multimodal Pre-Training via Cross-Modality Gradient
Harmonization
- URL: http://arxiv.org/abs/2211.02077v1
- Date: Thu, 3 Nov 2022 18:12:32 GMT
- Title: Scaling Multimodal Pre-Training via Cross-Modality Gradient
Harmonization
- Authors: Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu
- Abstract summary: Self-supervised pre-training recently demonstrates success on large-scale multimodal data.
Cross-modality alignment (CMA) is only a weak and noisy supervision.
CMA might cause conflicts and biases among modalities.
- Score: 68.49738668084693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised pre-training recently demonstrates success on large-scale
multimodal data, and state-of-the-art contrastive learning methods often
enforce the feature consistency from cross-modality inputs, such as video/audio
or video/text pairs. Despite its convenience to formulate and leverage in
practice, such cross-modality alignment (CMA) is only a weak and noisy
supervision, since two modalities can be semantically misaligned even they are
temporally aligned. For example, even in the commonly adopted instructional
videos, a speaker can sometimes refer to something that is not visually present
in the current frame; and the semantic misalignment would only be more
unpredictable for the raw videos from the internet. We conjecture that might
cause conflicts and biases among modalities, and may hence prohibit CMA from
scaling up to training with larger and more heterogeneous data. This paper
first verifies our conjecture by observing that, even in the latest VATT
pre-training using only instructional videos, there exist strong gradient
conflicts between different CMA losses within the same video, audio, text
triplet, indicating them as the noisy source of supervision. We then propose to
harmonize such gradients, via two techniques: (i) cross-modality gradient
realignment: modifying different CMA loss gradients for each sample triplet, so
that their gradient directions are more aligned; and (ii) gradient-based
curriculum learning: leveraging the gradient conflict information on an
indicator of sample noisiness, to develop a curriculum learning strategy to
prioritize training on less noisy sample triplets. Applying those techniques to
pre-training VATT on the HowTo100M dataset, we consistently improve its
performance on different downstream tasks. Moreover, we are able to scale VATT
pre-training to more complicated non-narrative Youtube8M dataset to further
improve the state-of-the-arts.
Related papers
- Classifier-guided Gradient Modulation for Enhanced Multimodal Learning [50.7008456698935]
Gradient-Guided Modulation (CGGM) is a novel method to balance multimodal learning with gradients.
We conduct extensive experiments on four multimodal datasets: UPMC-Food 101, CMU-MOSI, IEMOCAP and BraTS.
CGGM outperforms all the baselines and other state-of-the-art methods consistently.
arXiv Detail & Related papers (2024-11-03T02:38:43Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - Two-Stage Triplet Loss Training with Curriculum Augmentation for
Audio-Visual Retrieval [3.164991885881342]
Cross- retrieval models learn robust embedding spaces.
We introduce a novel approach rooted in curriculum learning to address this problem.
We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets.
arXiv Detail & Related papers (2023-10-20T12:35:54Z) - Cross-head mutual Mean-Teaching for semi-supervised medical image
segmentation [6.738522094694818]
Semi-supervised medical image segmentation (SSMIS) has witnessed substantial advancements by leveraging limited labeled data and abundant unlabeled data.
Existing state-of-the-art (SOTA) methods encounter challenges in accurately predicting labels for the unlabeled data.
We propose a novel Cross-head mutual mean-teaching Network (CMMT-Net) incorporated strong-weak data augmentation.
arXiv Detail & Related papers (2023-10-08T09:13:04Z) - Few-Shot Classification with Contrastive Learning [10.236150550121163]
We propose a novel contrastive learning-based framework that seamlessly integrates contrastive learning into both stages.
In the meta-training stage, we propose a cross-view episodic training mechanism to perform the nearest centroid classification on two different views of the same episode.
These two strategies force the model to overcome the bias between views and promote the transferability of representations.
arXiv Detail & Related papers (2022-09-17T02:39:09Z) - PA-Seg: Learning from Point Annotations for 3D Medical Image
Segmentation using Contextual Regularization and Cross Knowledge Distillation [14.412073730567137]
We propose to annotate a segmentation target with only seven points in 3D medical images, and design a two-stage weakly supervised learning framework PA-Seg.
In the first stage, we employ geodesic distance transform to expand the seed points to provide more supervision signal.
In the second stage, we use predictions obtained by the model pre-trained in the first stage as pseudo labels.
arXiv Detail & Related papers (2022-08-11T07:00:33Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.