Video Representation Learning with Visual Tempo Consistency
- URL: http://arxiv.org/abs/2006.15489v2
- Date: Fri, 18 Dec 2020 03:02:33 GMT
- Title: Video Representation Learning with Visual Tempo Consistency
- Authors: Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou
- Abstract summary: We show that visual tempo can serve as a self-supervision signal for video representation learning.
We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning.
- Score: 105.20094164316836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual tempo, which describes how fast an action goes, has shown its
potential in supervised action recognition. In this work, we demonstrate that
visual tempo can also serve as a self-supervision signal for video
representation learning. We propose to maximize the mutual information between
representations of slow and fast videos via hierarchical contrastive learning
(VTHCL). Specifically, by sampling the same instance at slow and fast frame
rates respectively, we can obtain slow and fast video frames which share the
same semantics but contain different visual tempos. Video representations
learned from VTHCL achieve the competitive performances under the
self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%)
and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the
learned representations are generalized well to other downstream tasks
including action detection on AVA and action anticipation on Epic-Kitchen.
Finally, we propose Instance Correspondence Map (ICM) to visualize the shared
semantics captured by contrastive learning.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video [23.478555947694108]
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
We learn the meta-temporal visual variations in the video by leveraging the variations in the visual appearance according to playback speeds.
We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks.
arXiv Detail & Related papers (2020-03-05T15:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.