Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion
- URL: http://arxiv.org/abs/2009.05757v3
- Date: Wed, 16 Dec 2020 10:45:17 GMT
- Title: Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion
- Authors: Jinpeng Wang, Yuting Gao, Ke Li, Jianguo Hu, Xinyang Jiang, Xiaowei
Guo, Rongrong Ji, Xing Sun
- Abstract summary: Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
- Score: 86.56202610716504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One significant factor we expect the video representation learning to
capture, especially in contrast with the image representation learning, is the
object motion. However, we found that in the current mainstream video datasets,
some action categories are highly related with the scene where the action
happens, making the model tend to degrade to a solution where only the scene
information is encoded. For example, a trained model may predict a video as
playing football simply because it sees the field, neglecting that the subject
is dancing as a cheerleader on the field. This is against our original
intention towards the video representation learning and may bring scene bias on
different dataset that can not be ignored. In order to tackle this problem, we
propose to decouple the scene and the motion (DSM) with two simple operations,
so that the model attention towards the motion information is better paid.
Specifically, we construct a positive clip and a negative clip for each video.
Compared to the original video, the positive/negative is
motion-untouched/broken but scene-broken/untouched by Spatial Local Disturbance
and Temporal Local Disturbance. Our objective is to pull the positive closer
while pushing the negative farther to the original clip in the latent space. In
this way, the impact of the scene is weakened while the temporal sensitivity of
the network is further enhanced. We conduct experiments on two tasks with
various backbones and different pre-training datasets, and find that our method
surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards
action recognition task on the UCF101 and HMDB51 datasets respectively using
the same backbone.
Related papers
- DEVIAS: Learning Disentangled Video Representations of Action and Scene [3.336126457178601]
Video recognition models often learn scene-biased action representation due to the spurious correlation between actions and scenes in the training data.
We propose a disentangling encoder-decoder architecture to learn disentangled action and scene representations with a single model.
We rigorously validate the proposed method on the UCF-101, Kinetics-400, and HVU datasets for the seen, and the SCUBA, HAT, and HVU datasets for unseen action-scene combination scenarios.
arXiv Detail & Related papers (2023-11-30T18:58:44Z) - HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos.
We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability.
Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z) - Motion-aware Self-supervised Video Representation Learning via
Foreground-background Merging [19.311818681787845]
We propose Foreground-background Merging (FAME) to compose the foreground region of the selected video onto the background of others.
We show that FAME can significantly boost the performance in different downstream tasks with various backbones.
arXiv Detail & Related papers (2021-09-30T13:45:26Z) - JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - VideoMix: Rethinking Data Augmentation for Video Classification [29.923635550986997]
State-of-the-art video action classifiers often suffer from overfitting.
Recent data augmentation strategies have been reported to address the overfitting problems.
VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition.
arXiv Detail & Related papers (2020-12-07T05:40:33Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.