How Incomplete is Contrastive Learning? An Inter-intra Variant Dual
Representation Method for Self-supervised Video Recognition
- URL: http://arxiv.org/abs/2107.01194v2
- Date: Mon, 5 Jul 2021 02:07:19 GMT
- Title: How Incomplete is Contrastive Learning? An Inter-intra Variant Dual
Representation Method for Self-supervised Video Recognition
- Authors: Lin Zhang, Qi She, Zhengyang Shen, Changhu Wang
- Abstract summary: We propose to learn dual representations for each clip which encode intra-variance through a shuffle-rank pretext task.
Experiment results show that our method plays an essential role in balancing inter and intra variances.
Our method achieves $textbf82.0%$ and $textbf51.2%$ downstream classification accuracy on UCF101 and HMDB51 test sets respectively.
- Score: 27.844863353375896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning applied to self-supervised representation learning has
seen a resurgence in deep models. In this paper, we find that existing
contrastive learning based solutions for self-supervised video recognition
focus on inter-variance encoding but ignore the intra-variance existing in
clips within the same video. We thus propose to learn dual representations for
each clip which (\romannumeral 1) encode intra-variance through a shuffle-rank
pretext task; (\romannumeral 2) encode inter-variance through a temporal
coherent contrastive loss. Experiment results show that our method plays an
essential role in balancing inter and intra variances and brings consistent
performance gains on multiple backbones and contrastive learning frameworks.
Integrated with SimCLR and pretrained on Kinetics-400, our method achieves
$\textbf{82.0\%}$ and $\textbf{51.2\%}$ downstream classification accuracy on
UCF101 and HMDB51 test sets respectively and $\textbf{46.1\%}$ video retrieval
accuracy on UCF101, outperforming both pretext-task based and contrastive
learning based counterparts.
Related papers
- DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - SELF-VS: Self-supervised Encoding Learning For Video Summarization [6.21295508577576]
We propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder.
Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification.
arXiv Detail & Related papers (2023-03-28T14:08:05Z) - Self-Supervised Video Representation Learning with Meta-Contrastive
Network [10.768575680990415]
We propose a Meta-Contrastive Network (MCN) to enhance the learning ability of self-supervised approaches.
For two downstream tasks, i.e., video action recognition and video retrieval, MCN outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-19T01:21:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.