A Large-Scale Analysis on Self-Supervised Video Representation Learning
- URL: http://arxiv.org/abs/2306.06010v2
- Date: Mon, 20 Nov 2023 14:26:00 GMT
- Title: A Large-Scale Analysis on Self-Supervised Video Representation Learning
- Authors: Akash Kumar, Ashlesha Kumar, Vibhav Vineet, Yogesh Singh Rawat
- Abstract summary: We study five different aspects of self-supervised learning important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4) data noise, and, 5)feature analysis.
We present several interesting insights from this study which span across different properties of pretraining and target datasets, pretext-tasks, and model architectures.
We propose an approach that requires a limited amount of training data and outperforms existing state-of-the-art approaches which use 10x pretraining data.
- Score: 15.205738030787673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning is an effective way for label-free model
pre-training, especially in the video domain where labeling is expensive.
Existing self-supervised works in the video domain use varying experimental
setups to demonstrate their effectiveness and comparison across approaches
becomes challenging with no standard benchmark. In this work, we first provide
a benchmark that enables a comparison of existing approaches on the same
ground. Next, we study five different aspects of self-supervised learning
important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4)
data noise, and, 5)feature analysis. To facilitate this study, we focus on
seven different methods along with seven different network architectures and
perform an extensive set of experiments on 5 different datasets with an
evaluation of two different downstream tasks. We present several interesting
insights from this study which span across different properties of pretraining
and target datasets, pretext-tasks, and model architectures among others. We
further put some of these insights to the real test and propose an approach
that requires a limited amount of training data and outperforms existing
state-of-the-art approaches which use 10x pretraining data. We believe this
work will pave the way for researchers to a better understanding of
self-supervised pretext tasks in video representation learning.
Related papers
- An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging [6.363158395541767]
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data.
In this study, we investigate and compare the performance of new self-supervised methods for music tagging.
arXiv Detail & Related papers (2024-04-14T07:56:08Z) - A Review of Machine Learning Methods Applied to Video Analysis Systems [3.518774226658318]
The paper provides a survey of the development of machine-learning techniques for video analysis.
We provide summaries of the development of self-supervised learning, semi-supervised learning, active learning, and zero-shot learning for applications in video analysis.
arXiv Detail & Related papers (2023-12-08T20:24:03Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? [19.920980847895233]
We investigate how sensitive video self-supervised learning is to the currently used benchmark convention.
Our comprehensive set of over 500 experiments reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization.
arXiv Detail & Related papers (2022-03-27T06:32:55Z) - Self-Supervised Visual Representation Learning Using Lightweight
Architectures [0.0]
In self-supervised learning, a model is trained to solve a pretext task, using a data set whose annotations are created by a machine.
We critically examine the most notable pretext tasks to extract features from image data.
We study the performance of various self-supervised techniques keeping all other parameters uniform.
arXiv Detail & Related papers (2021-10-21T14:13:10Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z) - Contrasting Contrastive Self-Supervised Representation Learning Models [29.1857781719894]
We analyze contrastive approaches as one of the most successful and popular variants of self-supervised representation learning.
We examine over 700 training experiments including 30 encoders, 4 pre-training datasets and 20 diverse downstream tasks.
arXiv Detail & Related papers (2021-03-25T17:40:38Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Comprehensive Instructional Video Analysis: The COIN Dataset and
Performance Evaluation [100.68317848808327]
We present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis.
COIN dataset contains 11,827 videos of 180 tasks in 12 domains related to our daily life.
With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries.
arXiv Detail & Related papers (2020-03-20T16:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.