MoQuad: Motion-focused Quadruple Construction for Video Contrastive
Learning
- URL: http://arxiv.org/abs/2212.10870v1
- Date: Wed, 21 Dec 2022 09:26:40 GMT
- Title: MoQuad: Motion-focused Quadruple Construction for Video Contrastive
Learning
- Authors: Yuan Liu, Jiacheng Chen, Hao Wu
- Abstract summary: This paper presents a simple yet effective sample construction strategy to boost the learning of motion features in video contrastive learning.
The proposed method, dubbed Motion-focused Quadruple Construction (MoQuad), augments the instance discrimination by meticulously disturbing the appearance and motion of both the positive and negative samples.
By simply applying MoQuad to SimCLR, extensive experiments show that we achieve superior performance on downstream tasks compared to the state of the arts.
- Score: 10.41936704731324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning effective motion features is an essential pursuit of video
representation learning. This paper presents a simple yet effective sample
construction strategy to boost the learning of motion features in video
contrastive learning. The proposed method, dubbed Motion-focused Quadruple
Construction (MoQuad), augments the instance discrimination by meticulously
disturbing the appearance and motion of both the positive and negative samples
to create a quadruple for each video instance, such that the model is
encouraged to exploit motion information. Unlike recent approaches that create
extra auxiliary tasks for learning motion features or apply explicit temporal
modelling, our method keeps the simple and clean contrastive learning paradigm
(i.e.,SimCLR) without multi-task learning or extra modelling. In addition, we
design two extra training strategies by analyzing initial MoQuad experiments.
By simply applying MoQuad to SimCLR, extensive experiments show that we achieve
superior performance on downstream tasks compared to the state of the arts.
Notably, on the UCF-101 action recognition task, we achieve 93.7% accuracy
after pre-training the model on Kinetics-400 for only 200 epochs, surpassing
various previous methods
Related papers
- ProMotion: Prototypes As Motion Learners [46.08051377180652]
We introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks.
ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms.
We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion.
arXiv Detail & Related papers (2024-06-07T15:10:33Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - Seeing in Flowing: Adapting CLIP for Action Recognition with Motion
Prompts Learning [14.292812802621707]
Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training.
We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method.
Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training.
arXiv Detail & Related papers (2023-08-09T09:33:45Z) - Temporal Contrastive Learning with Curriculum [19.442685015494316]
ConCur is a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy.
We conduct experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-09-02T00:12:05Z) - Motion Sensitive Contrastive Learning for Self-supervised Video
Representation [34.854431881562576]
Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning.
Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities.
Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
arXiv Detail & Related papers (2022-08-12T04:06:56Z) - Motion-Focused Contrastive Learning of Video Representations [94.93666741396444]
Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning.
We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
arXiv Detail & Related papers (2022-01-11T16:15:45Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z) - Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation.
We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification.
We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.