HaltingVT: Adaptive Token Halting Transformer for Efficient Video
Recognition
- URL: http://arxiv.org/abs/2401.04975v1
- Date: Wed, 10 Jan 2024 07:42:55 GMT
- Title: HaltingVT: Adaptive Token Halting Transformer for Efficient Video
Recognition
- Authors: Qian Wu, Ruoxuan Cui, Yuke Li, Haoqi Zhu
- Abstract summary: Action recognition in videos poses a challenge due to its high computational cost.
We propose HaltingVT, an efficient video transformer adaptively removing redundant video patch tokens.
On the Mini-Kinetics dataset, we achieved 75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely low 9.9 GFLOPs.
- Score: 11.362605513514943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition in videos poses a challenge due to its high computational
cost, especially for Joint Space-Time video transformers (Joint VT). Despite
their effectiveness, the excessive number of tokens in such architectures
significantly limits their efficiency. In this paper, we propose HaltingVT, an
efficient video transformer adaptively removing redundant video patch tokens,
which is primarily composed of a Joint VT and a Glimpser module. Specifically,
HaltingVT applies data-adaptive token reduction at each layer, resulting in a
significant reduction in the overall computational cost. Besides, the Glimpser
module quickly removes redundant tokens in shallow transformer layers, which
may even be misleading for video recognition tasks based on our observations.
To further encourage HaltingVT to focus on the key motion-related information
in videos, we design an effective Motion Loss during training. HaltingVT
acquires video analysis capabilities and token halting compression strategies
simultaneously in a unified training process, without requiring additional
training procedures or sub-networks. On the Mini-Kinetics dataset, we achieved
75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely
low 9.9 GFLOPs. The code is available at
https://github.com/dun-research/HaltingVT.
Related papers
- SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Motion Guided Token Compression for Efficient Masked Video Modeling [7.548789718676737]
This paper showcases the enhanced performance achieved through an escalation in the frames per second (FPS) rate.
We also present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation.
Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0.
arXiv Detail & Related papers (2024-01-10T07:49:23Z) - Scattering Vision Transformer: Spectral Mixing Matters [3.0665715162712837]
We present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges.
SVT incorporates a spectrally scattering network that enables the capture of intricate image details.
SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS.
arXiv Detail & Related papers (2023-11-02T15:24:23Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Token Shift Transformer for Video Classification [34.05954523287077]
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals.
Its encoders naturally contain computational intensive operations such as pair-wise self-attention.
This paper presents Token Shift Module (i.e., TokShift) for modeling temporal relations within each transformer encoder.
arXiv Detail & Related papers (2021-08-05T08:04:54Z) - Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action
Localization [96.73647162960842]
TAL is a fundamental yet challenging task in video understanding.
Existing TAL methods rely on pre-training a video encoder through action classification supervision.
We introduce a novel low-fidelity end-to-end (LoFi) video encoder pre-training method.
arXiv Detail & Related papers (2021-03-28T22:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.