Self-Supervised Learning for Videos: A Survey
- URL: http://arxiv.org/abs/2207.00419v3
- Date: Wed, 19 Jul 2023 16:00:08 GMT
- Title: Self-Supervised Learning for Videos: A Survey
- Authors: Madeline C. Schiappa and Yogesh S. Rawat and Mubarak Shah
- Abstract summary: Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
- Score: 70.37277191524755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable success of deep learning in various domains relies on the
availability of large-scale annotated datasets. However, obtaining annotations
is expensive and requires great effort, which is especially challenging for
videos. Moreover, the use of human-generated annotations leads to models with
biased learning and poor domain generalization and robustness. As an
alternative, self-supervised learning provides a way for representation
learning which does not require annotations and has shown promise in both image
and video domains. Different from the image domain, learning video
representations are more challenging due to the temporal dimension, bringing in
motion and other environmental dynamics. This also provides opportunities for
video-exclusive ideas that advance self-supervised learning in the video and
multimodal domain. In this survey, we provide a review of existing approaches
on self-supervised learning focusing on the video domain. We summarize these
methods into four different categories based on their learning objectives: 1)
pretext tasks, 2) generative learning, 3) contrastive learning, and 4)
cross-modal agreement. We further introduce the commonly used datasets,
downstream evaluation tasks, insights into the limitations of existing works,
and the potential future directions in this area.
Related papers
- CDFSL-V: Cross-Domain Few-Shot Learning for Videos [58.37446811360741]
Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples.
Existing methods in video action recognition rely on large labeled datasets from the same domain.
We propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning.
arXiv Detail & Related papers (2023-09-07T19:44:27Z) - Towards Contrastive Learning in Music Video Domain [46.29203572184694]
We create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss.
For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song dataset.
Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks.
arXiv Detail & Related papers (2023-09-01T09:08:21Z) - A Large-Scale Analysis on Self-Supervised Video Representation Learning [15.205738030787673]
We study five different aspects of self-supervised learning important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4) data noise, and, 5)feature analysis.
We present several interesting insights from this study which span across different properties of pretraining and target datasets, pretext-tasks, and model architectures.
We propose an approach that requires a limited amount of training data and outperforms existing state-of-the-art approaches which use 10x pretraining data.
arXiv Detail & Related papers (2023-06-09T16:27:14Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Learning Object Manipulation Skills via Approximate State Estimation
from Real Videos [47.958512470724926]
Humans are adept at learning new tasks by watching a few instructional videos.
On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain.
In this paper, we explore a method that facilitates learning object manipulation skills directly from videos.
arXiv Detail & Related papers (2020-11-13T08:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.