Fine-grained Multi-Modal Self-Supervised Learning
- URL: http://arxiv.org/abs/2112.12182v1
- Date: Wed, 22 Dec 2021 19:17:45 GMT
- Title: Fine-grained Multi-Modal Self-Supervised Learning
- Authors: Duo Wang, Salah Karout
- Abstract summary: Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks.
Such pre-training requires large batch sizes and a large amount of computation resources due to the noise present in uncurated data.
We propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale.
- Score: 4.850800439026724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Modal Self-Supervised Learning from videos has been shown to improve
model's performance on various downstream tasks. However, such Self-Supervised
pre-training requires large batch sizes and a large amount of computation
resources due to the noise present in the uncurated data. This is partly due to
the fact that the prevalent training scheme is trained on coarse-grained
setting, in which vectors representing the whole video clips or natural
language sentences are used for computing similarity. Such scheme makes
training noisy as part of the video clips can be totally not correlated with
the other-modality input such as text description. In this paper, we propose a
fine-grained multi-modal self-supervised training scheme that computes the
similarity between embeddings at finer-scale (such as individual feature map
embeddings and embeddings of phrases), and uses attention mechanisms to reduce
noisy pairs' weighting in the loss function. We show that with the proposed
pre-training scheme, we can train smaller models, with smaller batch-size and
much less computational resources to achieve downstream tasks performances
comparable to State-Of-The-Art, for tasks including action recognition and
text-image retrievals.
Related papers
- Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - Robust Multimodal Learning via Representation Decoupling [6.7678581401558295]
Multimodal learning has attracted increasing attention due to its practicality.
Existing methods tend to address it by learning a common subspace representation for different modality combinations.
We propose a novel Decoupled Multimodal Representation Network (DMRNet) to assist robust multimodal learning.
arXiv Detail & Related papers (2024-07-05T12:09:33Z) - Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling [18.861945284506028]
Masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN)
arXiv Detail & Related papers (2023-11-28T20:42:30Z) - Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework.
To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules.
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z) - Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS)
Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime.
Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - General Purpose Text Embeddings from Pre-trained Language Models for
Scalable Inference [34.47592026375839]
We show that some of the computational cost during inference can be amortized over the different tasks using a shared text encoder.
We also compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks.
arXiv Detail & Related papers (2020-04-29T16:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.