ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency
- URL: http://arxiv.org/abs/2106.02342v1
- Date: Fri, 4 Jun 2021 08:44:50 GMT
- Title: ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency
- Authors: Deng Huang, Wenhao Wu, Weiwen Hu, Xu Liu, Dongliang He, Zhihua Wu,
Xiangmiao Wu, Mingkui Tan, Errui Ding
- Abstract summary: We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
- Score: 62.38914747727636
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study self-supervised video representation learning, which is a
challenging task due to 1) a lack of labels for explicit supervision and 2)
unstructured and noisy visual information. Existing methods mainly use
contrastive loss with video clips as the instances and learn visual
representation by discriminating instances from each other, but they require
careful treatment of negative pairs by relying on large batch sizes, memory
banks, extra modalities, or customized mining strategies, inevitably including
noisy data. In this paper, we observe that the consistency between positive
samples is the key to learn robust video representations. Specifically, we
propose two tasks to learn the appearance and speed consistency, separately.
The appearance consistency task aims to maximize the similarity between two
clips of the same video with different playback speeds. The speed consistency
task aims to maximize the similarity between two clips with the same playback
speed but different appearance information. We show that joint optimization of
the two tasks consistently improves the performance on downstream tasks, e.g.,
action recognition and video retrieval. Remarkably, for action recognition on
the UCF-101 dataset, we achieve 90.8% accuracy without using any additional
modalities or negative pairs for unsupervised pretraining, outperforming the
ImageNet supervised pre-trained model. Codes and models will be available.
Related papers
- Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.