How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?
- URL: http://arxiv.org/abs/2203.14221v1
- Date: Sun, 27 Mar 2022 06:32:55 GMT
- Title: How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?
- Authors: Fida Mohammad Thoker, Hazel Doughty, Piyush Bagad, Cees Snoek
- Abstract summary: We investigate how sensitive video self-supervised learning is to the currently used benchmark convention.
Our comprehensive set of over 500 experiments reveals that current benchmarks in video self-supervised learning are not a good indicator of generalization.
- Score: 19.920980847895233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent success of video self-supervised learning, there is much
still to be understood about their generalization capability. In this paper, we
investigate how sensitive video self-supervised learning is to the currently
used benchmark convention and whether methods generalize beyond the canonical
evaluation setting. We do this across four different factors of sensitivity:
domain, samples, actions and task. Our comprehensive set of over 500
experiments, which encompasses 7 video datasets, 9 self-supervised methods and
6 video understanding tasks, reveals that current benchmarks in video
self-supervised learning are not a good indicator of generalization along these
sensitivity factors. Further, we find that self-supervised methods considerably
lag behind vanilla supervised pre-training, especially when domain shift is
large and the amount of available downstream samples are low. From our analysis
we distill the SEVERE-benchmark, a subset of our experiments, and discuss its
implication for evaluating the generalizability of representations obtained by
existing and future self-supervised video learning methods.
Related papers
- What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - A Large-Scale Analysis on Self-Supervised Video Representation Learning [15.205738030787673]
We study five different aspects of self-supervised learning important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4) data noise, and, 5)feature analysis.
We present several interesting insights from this study which span across different properties of pretraining and target datasets, pretext-tasks, and model architectures.
We propose an approach that requires a limited amount of training data and outperforms existing state-of-the-art approaches which use 10x pretraining data.
arXiv Detail & Related papers (2023-06-09T16:27:14Z) - Unsupervised Embedding Quality Evaluation [6.72542623686684]
SSL models are often unclear whether they will perform well when transferred to another domain.
Can we quantify how easy it is to linearly separate the data in a stable way?
We introduce one novel method based on recent advances in understanding the high-dimensional geometric structure of self-supervised learning.
arXiv Detail & Related papers (2023-05-26T01:06:44Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - Hierarchical Self-supervised Representation Learning for Movie
Understanding [24.952866206036536]
We propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model.
Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task.
We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores)
arXiv Detail & Related papers (2022-04-06T21:28:41Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Sense and Learn: Self-Supervision for Omnipresent Sensors [9.442811508809994]
We present a framework named Sense and Learn for representation or feature learning from raw sensory data.
It consists of several auxiliary tasks that can learn high-level and broadly useful features entirely from unannotated data without any human involvement in the tedious labeling process.
Our methodology achieves results that are competitive with the supervised approaches and close the gap through fine-tuning a network while learning the downstream tasks in most cases.
arXiv Detail & Related papers (2020-09-28T11:57:43Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.