Related papers: Token Shift Transformer for Video Classification

Token Shift Transformer for Video Classification

URL: http://arxiv.org/abs/2108.02432v1
Date: Thu, 5 Aug 2021 08:04:54 GMT
Title: Token Shift Transformer for Video Classification
Authors: Hao Zhang, Yanbin Hao, Chong-Wah Ngo
Abstract summary: Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals. Its encoders naturally contain computational intensive operations such as pair-wise self-attention. This paper presents Token Shift Module (i.e., TokShift) for modeling temporal relations within each transformer encoder.
Score: 34.05954523287077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer.

Related papers

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos. Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks. Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z)
EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos. We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z)
On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z)
Cats: Complementary CNN and Transformer Encoders for Segmentation [13.288195115791758]
We propose a model with double encoders for 3D biomedical image segmentation. We fuse the information from the convolutional encoder and the transformer, and pass it to the decoder to obtain the results. Compared to the state-of-the-art models with and without transformers on each task, our proposed method obtains higher Dice scores across the board.
arXiv Detail & Related papers (2022-08-24T14:25:11Z)
Efficient Attention-free Video Shift Transformers [56.87581500474093]
This paper tackles the problem of efficient video recognition. Video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum. We extend our formulation in the video domain to construct Video Affine-Shift Transformer.
arXiv Detail & Related papers (2022-08-23T17:48:29Z)
Deep Hyperspectral Unmixing using Transformer Network [7.3050653207383025]
We propose a novel deep unmixing model with transformers. The proposed model is a combination of a convolutional autoencoder and a transformer. The data are reconstructed using a convolutional decoder.
arXiv Detail & Related papers (2022-03-31T14:47:36Z)
Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation. tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.