Related papers: How can objects help action recognition?

How can objects help action recognition?

URL: http://arxiv.org/abs/2306.11726v1
Date: Tue, 20 Jun 2023 17:56:16 GMT
Title: How can objects help action recognition?
Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
Abstract summary: We investigate how we can use knowledge of objects to design better video models. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens. Second, we propose an object-aware attention module that enriches our feature representation with object information.
Score: 74.29564964727813
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

Related papers

Learning to Attribute with Attention [75.61481181755744]
We propose treating attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution. Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations.
arXiv Detail & Related papers (2025-04-18T15:36:28Z)
Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models [50.214593234229255]
We introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. We propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Principles of Visual Tokens for Efficient Video Understanding [36.05950369461622]
Video understanding has made huge strides in recent years, relying largely on the power of the transformer architecture. This has led to many creative solutions, including token merging and token selection. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the random sampling baseline. We propose a lightweight video model we call LITE that can select a small number of tokens effectively, outperforming state-of-the-art computation (GFLOPs) vs accuracy.
arXiv Detail & Related papers (2024-11-20T14:09:47Z)
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers [32.167072183575925]
We propose a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.
arXiv Detail & Related papers (2024-10-17T22:45:13Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead. We propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks.
arXiv Detail & Related papers (2024-10-06T09:18:04Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
Efficient Video Action Detection with Token Dropout and Context Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs) In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames. Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z)
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%. DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.