Motion-aware Self-supervised Video Representation Learning via
Foreground-background Merging
- URL: http://arxiv.org/abs/2109.15130v1
- Date: Thu, 30 Sep 2021 13:45:26 GMT
- Title: Motion-aware Self-supervised Video Representation Learning via
Foreground-background Merging
- Authors: Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Haohang Xu, Qingyi
Chen, Jue Wang
- Abstract summary: We propose Foreground-background Merging (FAME) to compose the foreground region of the selected video onto the background of others.
We show that FAME can significantly boost the performance in different downstream tasks with various backbones.
- Score: 19.311818681787845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In light of the success of contrastive learning in the image domain, current
self-supervised video representation learning methods usually employ
contrastive loss to facilitate video representation learning. When naively
pulling two augmented views of a video closer, the model however tends to learn
the common static background as a shortcut but fails to capture the motion
information, a phenomenon dubbed as background bias. This bias makes the model
suffer from weak generalization ability, leading to worse performance on
downstream tasks such as action recognition. To alleviate such bias, we propose
Foreground-background Merging (FAME) to deliberately compose the foreground
region of the selected video onto the background of others. Specifically,
without any off-the-shelf detector, we extract the foreground and background
regions via the frame difference and color statistics, and shuffle the
background regions among the videos. By leveraging the semantic consistency
between the original clips and the fused ones, the model focuses more on the
foreground motion pattern and is thus more robust to the background context.
Extensive experiments demonstrate that FAME can significantly boost the
performance in different downstream tasks with various backbones. When
integrated with MoCo, FAME reaches 84.8% and 53.5% accuracy on UCF101 and
HMDB51, respectively, achieving the state-of-the-art performance.
Related papers
- Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation [15.569467643817447]
We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations.
We train on real-world videos enhanced with this innovative motion depiction approach.
To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
arXiv Detail & Related papers (2024-05-26T00:53:26Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Saliency detection with moving camera via background model completion [0.5076419064097734]
We propose a new framework called saliency detection via background model completion (SDBMC)
It comprises a background modeler and deep learning background/foreground segmentation network.
We adopt the background/foreground segmenter, although pre-trained with a specific video dataset, can also detect saliency in unseen videos.
arXiv Detail & Related papers (2021-10-30T11:17:58Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Removing the Background by Adding the Background: Towards Background
Robust Self-supervised Video Representation Learning [105.42550534895828]
Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks.
Some of the current methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion.
We propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample.
Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to
arXiv Detail & Related papers (2020-09-12T11:25:13Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.