Efficient Video Transformers with Spatial-Temporal Token Selection
- URL: http://arxiv.org/abs/2111.11591v1
- Date: Tue, 23 Nov 2021 00:35:58 GMT
- Title: Efficient Video Transformers with Spatial-Temporal Token Selection
- Authors: Junke Wang, Xitong Yang, Hengduo Li, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.
Our framework achieves similar results while requiring 20% less computation.
- Score: 68.27784654734396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video transformers have achieved impressive results on major video
recognition benchmarks, however they suffer from high computational cost. In
this paper, we present STTS, a token selection framework that dynamically
selects a few informative tokens in both temporal and spatial dimensions
conditioned on input video samples. Specifically, we formulate token selection
as a ranking problem, which estimates the importance of each token through a
lightweight selection network and only those with top scores will be used for
downstream evaluation. In the temporal dimension, we keep the frames that are
most relevant for recognizing action categories, while in the spatial
dimension, we identify the most discriminative region in feature maps without
affecting spatial context used in a hierarchical way in most video
transformers. Since the decision of token selection is non-differentiable, we
employ a perturbed-maximum based differentiable Top-K operator for end-to-end
training. We conduct extensive experiments on Kinetics-400 with a recently
introduced video transformer backbone, MViT. Our framework achieves similar
results while requiring 20% less computation. We also demonstrate that our
approach is compatible with other transformer architectures.
Related papers
- Exploring the Design Space of Visual Context Representation in Video MLLMs [102.11582556690388]
Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks.
Visual context representation refers to the scheme to select frames from a video and further select the tokens from a frame.
In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes.
arXiv Detail & Related papers (2024-10-17T15:59:52Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos.
We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - Inductive and Transductive Few-Shot Video Classification via Appearance
and Temporal Alignments [17.673345523918947]
We present a novel method for few-shot video classification, which performs appearance and temporal alignments.
Our approach achieves similar or better results than previous methods on both datasets.
arXiv Detail & Related papers (2022-07-21T23:28:52Z) - TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval [42.0544426476143]
We propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture.
Based on thorough experiments, the proposed TS2-Net achieves state-of-the-art performance on major text-video retrieval benchmarks.
arXiv Detail & Related papers (2022-07-16T06:50:27Z) - Deformable Video Transformer [44.71254375663616]
We introduce the Deformable Video Transformer (DVT), which predicts a small subset of video patches to attend for each query location based on motion information.
Our model achieves higher accuracy at the same or lower computational cost, and it attains state-of-the-art results on four datasets.
arXiv Detail & Related papers (2022-03-31T04:52:27Z) - Is Space-Time Attention All You Need for Video Understanding? [50.78676438502343]
We present a convolution-free approach to built exclusively on self-attention over space and time.
"TimeSformer" adapts the standard Transformer architecture to video by enabling feature learning from a sequence of frame-level patches.
TimeSformer achieves state-of-the-art results on several major action recognition benchmarks.
arXiv Detail & Related papers (2021-02-09T19:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.