MAR: Masked Autoencoders for Efficient Action Recognition
- URL: http://arxiv.org/abs/2207.11660v1
- Date: Sun, 24 Jul 2022 04:27:36 GMT
- Title: MAR: Masked Autoencoders for Efficient Action Recognition
- Authors: Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang,
Yiliang Lv, Changxin Gao, Nong Sang
- Abstract summary: Vision Transformers (ViT) can complement between contexts given only limited visual contents.
Mar reduces redundant by discarding a proportion of patches and operating only on a part of videos.
Mar consistently outperforms existing ViT models with a notable margin.
- Score: 46.10824456139004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard approaches for video recognition usually operate on the full input
videos, which is inefficient due to the widely present spatio-temporal
redundancy in videos. Recent progress in masked video modelling, i.e.,
VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to
complement spatio-temporal contexts given only limited visual contents.
Inspired by this, we propose propose Masked Action Recognition (MAR), which
reduces the redundant computation by discarding a proportion of patches and
operating only on a part of the videos. MAR contains the following two
indispensable components: cell running masking and bridging classifier.
Specifically, to enable the ViT to perceive the details beyond the visible
patches easily, cell running masking is presented to preserve the
spatio-temporal correlations in videos, which ensures the patches at the same
spatial location can be observed in turn for easy reconstructions.
Additionally, we notice that, although the partially observed features can
reconstruct semantically explicit invisible patches, they fail to achieve
accurate classification. To address this, a bridging classifier is proposed to
bridge the semantic gap between the ViT encoded features for reconstruction and
the features specialized for classification. Our proposed MAR reduces the
computational cost of ViT by 53% and extensive experiments show that MAR
consistently outperforms existing ViT models with a notable margin. Especially,
we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a
standard training scheme by convincing margins on both Kinetics-400 and
Something-Something v2 datasets, while our computation overhead of ViT-Large is
only 14.5% of ViT-Huge.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.