A Large-Scale Study on Unsupervised Spatiotemporal Representation
Learning
- URL: http://arxiv.org/abs/2104.14558v1
- Date: Thu, 29 Apr 2021 17:59:53 GMT
- Title: A Large-Scale Study on Unsupervised Spatiotemporal Representation
Learning
- Authors: Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming
He
- Abstract summary: We present a large-scale study on unsupervised representation learning from videos.
Our objective encourages temporally-persistent features in the same video.
We find that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds.
- Score: 60.720251418816815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a large-scale study on unsupervised spatiotemporal representation
learning from videos. With a unified perspective on four recent image-based
frameworks, we study a simple objective that can easily generalize all these
methods to space-time. Our objective encourages temporally-persistent features
in the same video, and in spite of its simplicity, it works surprisingly well
across: (i) different unsupervised frameworks, (ii) pre-training datasets,
(iii) downstream datasets, and (iv) backbone architectures. We draw a series of
intriguing observations from this study, e.g., we discover that encouraging
long-spanned persistency can be effective even if the timespan is 60 seconds.
In addition to state-of-the-art results in multiple benchmarks, we report a few
promising cases in which unsupervised pre-training can outperform its
supervised counterpart. Code is made available at
https://github.com/facebookresearch/SlowFast
Related papers
- A Large-scale Study of Spatiotemporal Representation Learning with a New
Benchmark on Action Recognition [14.226201098201244]
BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional)
We thoroughly evaluate 6 commontemporal models pre-trained by both supervised and self-supervised learning.
Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - DyG2Vec: Efficient Representation Learning for Dynamic Graphs [26.792732615703372]
Temporal graph neural networks have shown promising results in learning inductive representations by automatically extracting temporal patterns.
We present an efficient yet effective attention-based encoder that leverages temporal edge encodings and window-based subgraph sampling to generate task-agnostic embeddings.
arXiv Detail & Related papers (2022-10-30T18:13:04Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation [121.5383855764944]
We use contrastive self-supervised learning to extract rich latent vectors from single-view videos.
We show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space.
Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.
arXiv Detail & Related papers (2020-12-02T20:27:35Z) - SeCo: Exploring Sequence Supervision for Unsupervised Representation
Learning [114.58986229852489]
In this paper, we explore the basic and generic supervision in the sequence from spatial, sequential and temporal perspectives.
We derive a particular form named Contrastive Learning (SeCo)
SeCo shows superior results under the linear protocol on action recognition, untrimmed activity recognition and object tracking.
arXiv Detail & Related papers (2020-08-03T15:51:35Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.