Motion-Focused Contrastive Learning of Video Representations
- URL: http://arxiv.org/abs/2201.04029v1
- Date: Tue, 11 Jan 2022 16:15:45 GMT
- Title: Motion-Focused Contrastive Learning of Video Representations
- Authors: Rui Li and Yiheng Zhang and Zhaofan Qiu and Ting Yao and Dong Liu and
Tao Mei
- Abstract summary: Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning.
We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
- Score: 94.93666741396444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion, as the most distinct phenomenon in a video to involve the changes
over time, has been unique and critical to the development of video
representation learning. In this paper, we ask the question: how important is
the motion particularly for self-supervised video representation learning. To
this end, we compose a duet of exploiting the motion for data augmentation and
feature learning in the regime of contrastive learning. Specifically, we
present a Motion-focused Contrastive Learning (MCL) method that regards such
duet as the foundation. On one hand, MCL capitalizes on optical flow of each
frame in a video to temporally and spatially sample the tubelets (i.e.,
sequences of associated frame patches across time) as data augmentations. On
the other hand, MCL further aligns gradient maps of the convolutional layers to
optical flow maps from spatial, temporal and spatio-temporal perspectives, in
order to ground motion information in feature learning. Extensive experiments
conducted on R(2+1)D backbone demonstrate the effectiveness of our MCL. On
UCF101, the linear classifier trained on the representations learnt by MCL
achieves 81.91% top-1 accuracy, outperforming ImageNet supervised pre-training
by 6.78%. On Kinetics-400, MCL achieves 66.62% top-1 accuracy under the linear
protocol. Code is available at
https://github.com/YihengZhang-CV/MCL-Motion-Focused-Contrastive-Learning.
Related papers
- Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders [46.38458873424361]
Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
arXiv Detail & Related papers (2022-10-09T03:22:15Z) - Motion Sensitive Contrastive Learning for Self-supervised Video
Representation [34.854431881562576]
Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning.
Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities.
Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
arXiv Detail & Related papers (2022-08-12T04:06:56Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.