Self-supervised Video Representation Learning by Context and Motion
Decoupling
- URL: http://arxiv.org/abs/2104.00862v1
- Date: Fri, 2 Apr 2021 02:47:34 GMT
- Title: Self-supervised Video Representation Learning by Context and Motion
Decoupling
- Authors: Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, Rong Jin
- Abstract summary: A challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias.
We develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task.
Experiments show that our approach improves the quality of the learned video representation over previous works.
- Score: 45.510042484456854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key challenge in self-supervised video representation learning is how to
effectively capture motion information besides context bias. While most
existing works implicitly achieve this with video-specific pretext tasks (e.g.,
predicting clip orders, time arrows, and paces), we develop a method that
explicitly decouples motion supervision from context bias through a carefully
designed pretext task. Specifically, we take the keyframes and motion vectors
in compressed videos (e.g., in H.264 format) as the supervision sources for
context and motion, respectively, which can be efficiently extracted at over
500 fps on the CPU. Then we design two pretext tasks that are jointly
optimized: a context matching task where a pairwise contrastive loss is cast
between video clip and keyframe features; and a motion prediction task where
clip features, passed through an encoder-decoder network, are used to estimate
motion features in a near future. These two tasks use a shared video backbone
and separate MLP heads. Experiments show that our approach improves the quality
of the learned video representation over previous works, where we obtain
absolute gains of 16.0% and 11.1% in video retrieval recall on UCF101 and
HMDB51, respectively. Moreover, we find the motion prediction to be a strong
regularization for video networks, where using it as an auxiliary task improves
the accuracy of action recognition with a margin of 7.4%~13.8%.
Related papers
- VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.
We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - Temporal Alignment Networks for Long-term Video [103.69904379356413]
We propose a temporal alignment network that ingests long term video sequences, and associated text sentences.
We train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise.
Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset.
arXiv Detail & Related papers (2022-04-06T17:59:46Z) - Self-supervised Video Representation Learning with Cross-Stream
Prototypical Contrasting [2.2530496464901106]
"Video Cross-Stream Prototypical Contrasting" is a novel method which predicts consistent prototype assignments from both RGB and optical flow views.
We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition.
arXiv Detail & Related papers (2021-06-18T13:57:51Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.