Cross-modal Manifold Cutmix for Self-supervised Video Representation
Learning
- URL: http://arxiv.org/abs/2112.03906v3
- Date: Thu, 27 Jul 2023 18:02:40 GMT
- Title: Cross-modal Manifold Cutmix for Self-supervised Video Representation
Learning
- Authors: Srijan Das and Michael S. Ryoo
- Abstract summary: This paper focuses on designing video augmentation for self-supervised learning.
We first analyze the best strategy to mix videos to create a new augmented video sample.
We propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.
- Score: 50.544635516455116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive representation learning of videos highly relies on the
availability of millions of unlabelled videos. This is practical for videos
available on web but acquiring such large scale of videos for real-world
applications is very expensive and laborious.
Therefore, in this paper we focus on designing video augmentation for
self-supervised learning, we first analyze the best strategy to mix videos to
create a new augmented video sample. Then, the question remains, can we make
use of the other modalities in videos for data mixing? To this end, we propose
Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another
video tesseract in the feature space across two different modalities. We find
that our video mixing strategy STC-mix, i.e. preliminary mixing of videos
followed by CMMC across different modalities in a video, improves the quality
of learned video representations. We conduct thorough experiments for two
downstream tasks: action recognition and video retrieval on two small scale
video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our
STC-mix on NTU dataset where domain knowledge is limited.
We show that the performance of our STC-mix on both the downstream tasks is
on par with the other self-supervised approaches while requiring less training
data.
Related papers
- Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.
We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - VideoMix: Rethinking Data Augmentation for Video Classification [29.923635550986997]
State-of-the-art video action classifiers often suffer from overfitting.
Recent data augmentation strategies have been reported to address the overfitting problems.
VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition.
arXiv Detail & Related papers (2020-12-07T05:40:33Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.