Self-Supervised Video Representation Learning with Motion-Contrastive
Perception
- URL: http://arxiv.org/abs/2204.04607v1
- Date: Sun, 10 Apr 2022 05:34:46 GMT
- Title: Self-Supervised Video Representation Learning with Motion-Contrastive
Perception
- Authors: Jinyu Liu, Ying Cheng, Yuejie Zhang, Rui-Wei Zhao, Rui Feng
- Abstract summary: Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
- Score: 13.860736711747284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-only self-supervised learning has achieved significant improvement in
video representation learning. Existing related methods encourage models to
learn video representations by utilizing contrastive learning or designing
specific pretext tasks. However, some models are likely to focus on the
background, which is unimportant for learning video representations. To
alleviate this problem, we propose a new view called long-range residual frame
to obtain more motion-specific information. Based on this, we propose the
Motion-Contrastive Perception Network (MCPNet), which consists of two branches,
namely, Motion Information Perception (MIP) and Contrastive Instance Perception
(CIP), to learn generic video representations by focusing on the changing areas
in videos. Specifically, the MIP branch aims to learn fine-grained motion
features, and the CIP branch performs contrastive learning to learn overall
semantics information for each instance. Experiments on two benchmark datasets
UCF-101 and HMDB-51 show that our method outperforms current state-of-the-art
visual-only self-supervised approaches.
Related papers
- MV2MAE: Multi-View Video Masked Autoencoders [33.61642891911761]
We present a method for self-supervised learning from synchronized multi-view videos.
We use a cross-view reconstruction task to inject geometry information in the model.
Our approach is based on the masked autoencoder (MAE) framework.
arXiv Detail & Related papers (2024-01-29T05:58:23Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - MoDist: Motion Distillation for Self-supervised Video Representation
Learning [27.05772951598066]
MoDist is a novel method to distill motion information into self-supervised video representations.
We show that MoDist focus more on foreground motion regions and thus generalizes better to downstream tasks.
arXiv Detail & Related papers (2021-06-17T17:57:11Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Video Representation Learning by Recognizing Temporal Transformations [37.59322456034611]
We introduce a novel self-supervised learning approach to learn representations of videos responsive to changes in the motion dynamics.
We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions.
Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition.
arXiv Detail & Related papers (2020-07-21T11:43:01Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.