Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
- URL: http://arxiv.org/abs/2308.04549v1
- Date: Tue, 8 Aug 2023 19:38:15 GMT
- Title: Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
- Authors: Shuangrui Ding, Peisen Zhao, Xiaopeng Zhang, Rui Qian, Hongkai Xiong,
Qi Tian
- Abstract summary: STA score considers two critical factors: temporal redundancy and semantic importance.
We apply the STA module to off-the-shelf video Transformers and Videowins.
Results: Kinetics-400 and Something-Something V2 achieve 30% overshelf reduction with a negligible 0.2% accuracy drop.
- Score: 89.88214896713846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have become the primary backbone of the computer vision
community due to their impressive performance. However, the unfriendly
computation cost impedes their potential in the video recognition domain. To
optimize the speed-accuracy trade-off, we propose Semantic-aware Temporal
Accumulation score (STA) to prune spatio-temporal tokens integrally. STA score
considers two critical factors: temporal redundancy and semantic importance.
The former depicts a specific region based on whether it is a new occurrence or
a seen entity by aggregating token-to-token similarity in consecutive frames
while the latter evaluates each token based on its contribution to the overall
prediction. As a result, tokens with higher scores of STA carry more temporal
redundancy as well as lower semantics thus being pruned. Based on the STA
score, we are able to progressively prune the tokens without introducing any
additional parameters or requiring further re-training. We directly apply the
STA module to off-the-shelf ViT and VideoSwin backbones, and the empirical
results on Kinetics-400 and Something-Something V2 achieve over 30% computation
reduction with a negligible ~0.2% accuracy drop. The code is released at
https://github.com/Mark12Ding/STA.
Related papers
- Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score.
Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Accelerating Attention through Gradient-Based Learned Runtime Pruning [9.109136535767478]
Self-attention is a key enabler of state-of-art accuracy for transformer-based Natural Language Processing models.
This paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
We devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism.
arXiv Detail & Related papers (2022-04-07T05:31:13Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.