Removing the Background by Adding the Background: Towards Background
Robust Self-supervised Video Representation Learning
- URL: http://arxiv.org/abs/2009.05769v4
- Date: Thu, 22 Apr 2021 03:37:30 GMT
- Title: Removing the Background by Adding the Background: Towards Background
Robust Self-supervised Video Representation Learning
- Authors: Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J. Ma, Hao Cheng, Pai
Peng, Feiyue Huang, Rongrong Ji, Xing Sun
- Abstract summary: Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks.
Some of the current methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion.
We propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample.
Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to
- Score: 105.42550534895828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has shown great potentials in improving the video
representation ability of deep neural networks by getting supervision from the
data itself. However, some of the current methods tend to cheat from the
background, i.e., the prediction is highly dependent on the video background
instead of the motion, making the model vulnerable to background changes. To
mitigate the model reliance towards the background, we propose to remove the
background impact by adding the background. That is, given a video, we randomly
select a static frame and add it to every other frames to construct a
distracting video sample. Then we force the model to pull the feature of the
distracting video and the feature of the original video closer, so that the
model is explicitly restricted to resist the background influence, focusing
more on the motion changes. We term our method as \emph{Background Erasing}
(BE). It is worth noting that the implementation of our method is so simple and
neat and can be added to most of the SOTA methods without much efforts.
Specifically, BE brings 16.4% and 19.1% improvements with MoCo on the severely
biased datasets UCF101 and HMDB51, and 14.5% improvement on the less biased
dataset Diving48.
Related papers
- Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [52.93036326078229]
Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging.
In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task.
Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
arXiv Detail & Related papers (2023-05-17T17:59:16Z) - Saliency-aware Stereoscopic Video Retargeting [4.332879001008757]
This paper proposes an unsupervised deep learning-based stereo video network.
Our model first detects the salient objects and shifts and warps all objects that it minimizes the distortion of the salient parts of the stereo frames.
To train the network, we use the attention mechanism to fuse the left and right views and feed the retargeted frames to a reconstruction module that reverses the retargeted frames to the parallax input frames.
arXiv Detail & Related papers (2023-04-18T09:38:33Z) - CLAD: A Contrastive Learning based Approach for Background Debiasing [43.0296255565593]
We introduce a contrastive learning-based approach to mitigate the background bias in CNNs.
We achieve state-of-the-art results on the Background Challenge dataset, outperforming the previous benchmark with a margin of 4.1%.
arXiv Detail & Related papers (2022-10-06T08:33:23Z) - Saliency detection with moving camera via background model completion [0.5076419064097734]
We propose a new framework called saliency detection via background model completion (SDBMC)
It comprises a background modeler and deep learning background/foreground segmentation network.
We adopt the background/foreground segmenter, although pre-trained with a specific video dataset, can also detect saliency in unseen videos.
arXiv Detail & Related papers (2021-10-30T11:17:58Z) - Motion-aware Self-supervised Video Representation Learning via
Foreground-background Merging [19.311818681787845]
We propose Foreground-background Merging (FAME) to compose the foreground region of the selected video onto the background of others.
We show that FAME can significantly boost the performance in different downstream tasks with various backbones.
arXiv Detail & Related papers (2021-09-30T13:45:26Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z) - TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting [107.39743751292028]
TransMoMo is capable of transferring motion of a person in a source video realistically to another video of a target person.
We exploit invariance properties of three factors of variation including motion, structure, and view-angle.
We demonstrate the effectiveness of our method over the state-of-the-art methods.
arXiv Detail & Related papers (2020-03-31T17:49:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.