SVFormer: Semi-supervised Video Transformer for Action Recognition
- URL: http://arxiv.org/abs/2211.13222v2
- Date: Thu, 6 Apr 2023 12:48:30 GMT
- Title: SVFormer: Semi-supervised Video Transformer for Action Recognition
- Authors: Zhen Xing and Qi Dai and Han Hu and Jingjing Chen and Zuxuan Wu and
Yu-Gang Jiang
- Abstract summary: We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
- Score: 88.52042032347173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised action recognition is a challenging but critical task due to
the high cost of video annotations. Existing approaches mainly use
convolutional neural networks, yet current revolutionary vision transformer
models have been less explored. In this paper, we investigate the use of
transformer models under the SSL setting for action recognition. To this end,
we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie,
EMA-Teacher) to cope with unlabeled video samples. While a wide range of data
augmentations have been shown effective for semi-supervised image
classification, they generally produce limited results for video recognition.
We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored
for video data where video clips are mixed via a mask with consistent masked
tokens over the temporal axis. In addition, we propose a temporal warping
augmentation to cover the complex temporal variation in videos, which stretches
selected frames to various temporal durations in the clip. Extensive
experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the
advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art
by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
Our method can hopefully serve as a strong benchmark and encourage future
search on semi-supervised action recognition with Transformer networks.
Related papers
- Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - MAR: Masked Autoencoders for Efficient Action Recognition [46.10824456139004]
Vision Transformers (ViT) can complement between contexts given only limited visual contents.
Mar reduces redundant by discarding a proportion of patches and operating only on a part of videos.
Mar consistently outperforms existing ViT models with a notable margin.
arXiv Detail & Related papers (2022-07-24T04:27:36Z) - BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.