Token Shift Transformer for Video Classification
- URL: http://arxiv.org/abs/2108.02432v1
- Date: Thu, 5 Aug 2021 08:04:54 GMT
- Title: Token Shift Transformer for Video Classification
- Authors: Hao Zhang, Yanbin Hao, Chong-Wah Ngo
- Abstract summary: Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals.
Its encoders naturally contain computational intensive operations such as pair-wise self-attention.
This paper presents Token Shift Module (i.e., TokShift) for modeling temporal relations within each transformer encoder.
- Score: 34.05954523287077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer achieves remarkable successes in understanding 1 and
2-dimensional signals (e.g., NLP and Image Content Understanding). As a
potential alternative to convolutional neural networks, it shares merits of
strong interpretability, high discriminative power on hyper-scale data, and
flexibility in processing varying length inputs. However, its encoders
naturally contain computational intensive operations such as pair-wise
self-attention, incurring heavy computational burden when being applied on the
complex 3-dimensional video signals.
This paper presents Token Shift Module (i.e., TokShift), a novel,
zero-parameter, zero-FLOPs operator, for modeling temporal relations within
each transformer encoder. Specifically, the TokShift barely temporally shifts
partial [Class] token features back-and-forth across adjacent frames. Then, we
densely plug the module into each encoder of a plain 2D vision transformer for
learning 3D video representation. It is worth noticing that our TokShift
transformer is a pure convolutional-free video transformer pilot with
computational efficiency for video understanding. Experiments on standard
benchmarks verify its robustness, effectiveness, and efficiency. Particularly,
with input clips of 8/12 frames, the TokShift transformer achieves SOTA
precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80%
on UCF-101 datasets, comparable or better than existing SOTA convolutional
counterparts. Our code is open-sourced in:
https://github.com/VideoNetworks/TokShift-Transformer.
Related papers
- Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos.
We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - Cats: Complementary CNN and Transformer Encoders for Segmentation [13.288195115791758]
We propose a model with double encoders for 3D biomedical image segmentation.
We fuse the information from the convolutional encoder and the transformer, and pass it to the decoder to obtain the results.
Compared to the state-of-the-art models with and without transformers on each task, our proposed method obtains higher Dice scores across the board.
arXiv Detail & Related papers (2022-08-24T14:25:11Z) - Efficient Attention-free Video Shift Transformers [56.87581500474093]
This paper tackles the problem of efficient video recognition.
Video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum.
We extend our formulation in the video domain to construct Video Affine-Shift Transformer.
arXiv Detail & Related papers (2022-08-23T17:48:29Z) - Deep Hyperspectral Unmixing using Transformer Network [7.3050653207383025]
We propose a novel deep unmixing model with transformers.
The proposed model is a combination of a convolutional autoencoder and a transformer.
The data are reconstructed using a convolutional decoder.
arXiv Detail & Related papers (2022-03-31T14:47:36Z) - Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.