It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training
- URL: http://arxiv.org/abs/2210.05234v1
- Date: Tue, 11 Oct 2022 08:05:18 GMT
- Title: It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training
- Authors: Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li and Jingdong Wang
- Abstract summary: Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
- Score: 76.69480467101143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised video transformer pre-training has recently benefited from
the mask-and-predict pipeline. They have demonstrated outstanding effectiveness
on downstream video tasks and superior data efficiency on small datasets.
However, temporal relation is not fully exploited by these methods. In this
work, we explicitly investigate motion cues in videos as extra prediction
target and propose our Masked Appearance-Motion Modeling (MAM2) framework.
Specifically, we design an encoder-regressor-decoder pipeline for this task.
The regressor separates feature encoding and pretext tasks completion, such
that the feature extraction process is completed adequately by the encoder. In
order to guide the encoder to fully excavate spatial-temporal features, two
separate decoders are used for two pretext tasks of disentangled appearance and
motion prediction. We explore various motion prediction targets and figure out
RGB-difference is simple yet effective. As for appearance prediction, VQGAN
codes are leveraged as prediction target. With our pre-training pipeline,
convergence can be remarkably speed up, e.g., we only require half of epochs
than state-of-the-art VideoMAE (400 v.s. 800) to achieve the competitive
performance. Extensive experimental results prove that our method learns
generalized video representations. Notably, our MAM2 with ViT-B achieves 82.3%
on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on
HMDB51.
Related papers
- SimulFlow: Simultaneously Extracting Feature and Identifying Target for
Unsupervised Video Object Segmentation [28.19471998380114]
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing.
Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks.
We propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification.
arXiv Detail & Related papers (2023-11-30T06:44:44Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Self-supervised Video Representation Learning with Cross-Stream
Prototypical Contrasting [2.2530496464901106]
"Video Cross-Stream Prototypical Contrasting" is a novel method which predicts consistent prototype assignments from both RGB and optical flow views.
We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition.
arXiv Detail & Related papers (2021-06-18T13:57:51Z) - Real-time Face Mask Detection in Video Data [0.5371337604556311]
We present a robust deep learning pipeline that is capable of identifying correct and incorrect mask-wearing from real-time video streams.
We devised two separate approaches and evaluated their performance and run-time efficiency.
arXiv Detail & Related papers (2021-05-05T01:03:34Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.