Uncovering the Hidden Dynamics of Video Self-supervised Learning under
Distribution Shifts
- URL: http://arxiv.org/abs/2306.02014v2
- Date: Mon, 30 Oct 2023 12:50:38 GMT
- Title: Uncovering the Hidden Dynamics of Video Self-supervised Learning under
Distribution Shifts
- Authors: Pritam Sarkar, Ahmad Beirami, Ali Etemad
- Abstract summary: We study the behavior of six popular self-supervised methods (v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various forms of natural distribution shift.
Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods.
- Score: 39.080610060557476
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video self-supervised learning (VSSL) has made significant progress in recent
years. However, the exact behavior and dynamics of these models under different
forms of distribution shift are not yet known. In this paper, we
comprehensively study the behavior of six popular self-supervised methods
(v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various
forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint
shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown
classes (zero-shot), and (vi) open-set recognition. To perform this extensive
study, we carefully craft a test bed consisting of 17 in-distribution and
out-of-distribution benchmark pairs using available public datasets and a
series of evaluation protocols to stress-test the different methods under the
intended shifts. Our study uncovers a series of intriguing findings and
interesting behaviors of VSSL methods. For instance, we observe that while
video models generally struggle with context shifts, v-MAE and supervised
learning exhibit more robustness. Moreover, our study shows that v-MAE is a
strong temporal learner, whereas contrastive methods, v-SimCLR and v-MoCo,
exhibit strong performances against viewpoint shifts. When studying the notion
of open-set recognition, we notice a trade-off between closed-set and open-set
recognition performance if the pretrained VSSL encoders are used without
finetuning. We hope that our work will contribute to the development of robust
video representation learning frameworks for various real-world scenarios. The
project page and code are available at: https://pritamqu.github.io/OOD-VSSL.
Related papers
- Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators [0.0]
We show how a large language model (LLM) can efficiently coordinate multiple vision-language models (VLMs) through natural language communication.
Our study investigates whether the same methodology can be applied to surveillance videos for action recognition.
arXiv Detail & Related papers (2024-07-20T10:26:28Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Teaching Matters: Investigating the Role of Supervision in Vision
Transformers [32.79398665600664]
We show that Vision Transformers (ViTs) learn a diverse range of behaviors in terms of their attention, representations, and downstream performance.
We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads.
Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method.
arXiv Detail & Related papers (2022-12-07T18:59:45Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - SCVRL: Shuffled Contrastive Video Representation Learning [28.06521069427918]
SCVRL is a contrastive-based framework for self-supervised learning for videos.
We reformulate the popular shuffling pretext task within a modern contrastive learning paradigm.
Our network has a natural capacity to learn motion in self-supervised settings and achieves strong performance, outperforming CVRL on four benchmarks.
arXiv Detail & Related papers (2022-05-24T01:24:47Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Exploit Clues from Views: Self-Supervised and Regularized Learning for
Multiview Object Recognition [66.87417785210772]
This work investigates the problem of multiview self-supervised learning (MV-SSL)
A novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation.
Experiments shows that the recognition and retrieval results using view invariant prototype embedding (VISPE) outperform other self-supervised learning methods.
arXiv Detail & Related papers (2020-03-28T07:06:06Z) - A Multi-view Perspective of Self-supervised Learning [24.14738533504335]
Self-supervised learning (SSL) recently gained widespread attention, which usually introduces a pretext task without manual annotation of data.
In this paper, we borrow a multi-view perspective to decouple a class of popular pretext tasks into a combination of view data augmentation (VDA) and view label classification (VLC)
Specifically, a simple multi-view learning framework is specially designed (SSL-MV), which assists the feature learning of downstream tasks (original view) through the same tasks on the augmented views.
arXiv Detail & Related papers (2020-02-22T13:26:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.