MOFO: MOtion FOcused Self-Supervision for Video Understanding
- URL: http://arxiv.org/abs/2308.12447v2
- Date: Wed, 1 Nov 2023 15:30:53 GMT
- Title: MOFO: MOtion FOcused Self-Supervision for Video Understanding
- Authors: Mona Ahmadian, Frank Guerin, and Andrew Gilbert
- Abstract summary: Self-supervised learning techniques have produced outstanding results in learning visual representations from unlabeled videos.
Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos.
We propose MOFO, a novel SSL method for focusing representation learning on the motion area of a video, for action recognition.
- Score: 11.641926922266347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) techniques have recently produced outstanding
results in learning visual representations from unlabeled videos. Despite the
importance of motion in supervised learning techniques for action recognition,
SSL methods often do not explicitly consider motion information in videos. To
address this issue, we propose MOFO (MOtion FOcused), a novel SSL method for
focusing representation learning on the motion area of a video, for action
recognition. MOFO automatically detects motion areas in videos and uses these
to guide the self-supervision task. We use a masked autoencoder which randomly
masks out a high proportion of the input sequence; we force a specified
percentage of the inside of the motion area to be masked and the remainder from
outside. We further incorporate motion information into the finetuning step to
emphasise motion in the downstream task. We demonstrate that our motion-focused
innovations can significantly boost the performance of the currently leading
SSL method (VideoMAE) for action recognition. Our method improves the recent
self-supervised Vision Transformer (ViT), VideoMAE, by achieving +2.6%, +2.1%,
+1.3% accuracy on Epic-Kitchens verb, noun and action classification,
respectively, and +4.7% accuracy on Something-Something V2 action
classification. Our proposed approach significantly improves the performance of
the current SSL method for action recognition, indicating the importance of
explicitly encoding motion in SSL.
Related papers
- MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Semi-supervised Active Learning for Video Action Detection [8.110693267550346]
We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data.
We evaluate the proposed approach on three different benchmark datasets, UCF-24-101, JHMDB-21, and Youtube-VOS.
arXiv Detail & Related papers (2023-12-12T11:13:17Z) - Improving Unsupervised Video Object Segmentation with Motion-Appearance
Synergy [52.03068246508119]
We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference.
IMAS achieves Improved UVOS with Motion-Appearance Synergy.
We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics.
arXiv Detail & Related papers (2022-12-17T06:47:30Z) - Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders [46.38458873424361]
Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
arXiv Detail & Related papers (2022-10-09T03:22:15Z) - Motion Sensitive Contrastive Learning for Self-supervised Video
Representation [34.854431881562576]
Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning.
Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities.
Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
arXiv Detail & Related papers (2022-08-12T04:06:56Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Motion-Focused Contrastive Learning of Video Representations [94.93666741396444]
Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning.
We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
arXiv Detail & Related papers (2022-01-11T16:15:45Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.