How can objects help action recognition?
- URL: http://arxiv.org/abs/2306.11726v1
- Date: Tue, 20 Jun 2023 17:56:16 GMT
- Title: How can objects help action recognition?
- Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
- Abstract summary: We investigate how we can use knowledge of objects to design better video models.
First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens.
Second, we propose an object-aware attention module that enriches our feature representation with object information.
- Score: 74.29564964727813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art video models process a video clip as a long sequence
of spatio-temporal tokens. However, they do not explicitly model objects, their
interactions across the video, and instead process all the tokens in the video.
In this paper, we investigate how we can use knowledge of objects to design
better video models, namely to process fewer tokens and to improve recognition
accuracy. This is in contrast to prior works which either drop tokens at the
cost of accuracy, or increase accuracy whilst also increasing the computation
required. First, we propose an object-guided token sampling strategy that
enables us to retain a small fraction of the input tokens with minimal impact
on accuracy. And second, we propose an object-aware attention module that
enriches our feature representation with object information and improves
overall accuracy. Our resulting framework achieves better performance when
using fewer tokens than strong baselines. In particular, we match our baseline
with 30%, 40%, and 60% of the input tokens on SomethingElse,
Something-something v2, and Epic-Kitchens, respectively. When we use our model
to process the same number of tokens as our baseline, we improve by 0.6 to 4.2
points on these datasets.
Related papers
- Principles of Visual Tokens for Efficient Video Understanding [36.05950369461622]
Video understanding has made huge strides in recent years, relying largely on the power of the transformer architecture.
This has led to many creative solutions, including token merging and token selection.
While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the random sampling baseline.
We propose a lightweight video model we call LITE that can select a small number of tokens effectively, outperforming state-of-the-art computation (GFLOPs) vs accuracy.
arXiv Detail & Related papers (2024-11-20T14:09:47Z) - Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers [32.167072183575925]
We propose a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens.
Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
arXiv Detail & Related papers (2024-10-17T22:45:13Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead.
We propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs.
Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks.
arXiv Detail & Related papers (2024-10-06T09:18:04Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens.
Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.