Related papers: Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

URL: http://arxiv.org/abs/2208.06105v1
Date: Fri, 12 Aug 2022 04:06:56 GMT
Title: Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Authors: Jingcheng Ni, Nan Zhou, Jie Qin, Qian Wu, Junqi Liu, Boxun Li, Di Huang
Abstract summary: Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning. Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
Score: 34.854431881562576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive learning has shown great potential in video representation learning. However, existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial to various down-stream video understanding tasks. In this paper, we propose Motion Sensitive Contrastive Learning (MSCL) that injects the motion information captured by optical flows into RGB frames to strengthen feature learning. To achieve this, in addition to clip-level global contrastive learning, we develop Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Moreover, we introduce Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples. Extensive experiments on standard benchmarks validate the effectiveness of the proposed method. With the commonly-used 3D ResNet-18 as the backbone, we achieve the top-1 accuracies of 91.5\% on UCF101 and 50.3\% on Something-Something v2 for video classification, and a 65.6\% Top-1 Recall on UCF101 for video retrieval, notably improving the state-of-the-art.

Related papers

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning [14.292812802621707]
Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training.
arXiv Detail & Related papers (2023-08-09T09:33:45Z)
MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning [10.41936704731324]
This paper presents a simple yet effective sample construction strategy to boost the learning of motion features in video contrastive learning. The proposed method, dubbed Motion-focused Quadruple Construction (MoQuad), augments the instance discrimination by meticulously disturbing the appearance and motion of both the positive and negative samples. By simply applying MoQuad to SimCLR, extensive experiments show that we achieve superior performance on downstream tasks compared to the state of the arts.
arXiv Detail & Related papers (2022-12-21T09:26:40Z)
Improving Unsupervised Video Object Segmentation with Motion-Appearance Synergy [52.03068246508119]
We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference. IMAS achieves Improved UVOS with Motion-Appearance Synergy. We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics.
arXiv Detail & Related papers (2022-12-17T06:47:30Z)
Self-Supervised Video Representation Learning with Motion-Contrastive Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet) MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP) Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z)
Motion-Focused Contrastive Learning of Video Representations [94.93666741396444]
Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning. We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
arXiv Detail & Related papers (2022-01-11T16:15:45Z)
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting [2.2530496464901106]
"Video Cross-Stream Prototypical Contrasting" is a novel method which predicts consistent prototype assignments from both RGB and optical flow views. We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition.
arXiv Detail & Related papers (2021-06-18T13:57:51Z)
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency. We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector. We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z)
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only. It is difficult to construct a suitable self-supervised task to well model both motion and appearance features. We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network. Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.