Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
- URL: http://arxiv.org/abs/2510.27255v2
- Date: Mon, 03 Nov 2025 07:33:58 GMT
- Title: Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
- Authors: Yehna Kim, Young-Eun Kim, Seong-Whan Lee,
- Abstract summary: We propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords.<n>This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation.<n>In our zero-shot experiments, our model achieves accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively.
- Score: 54.50887214639301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
Related papers
- Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration [72.84714132070404]
We propose a framework that shifts from passive interaction to active exploration of visual environments.<n>Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier.<n>On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement)
arXiv Detail & Related papers (2026-02-11T17:29:17Z) - Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization [1.6799377888527687]
This paper introduces a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition.<n>Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent "semantic action tokens"<n>These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM for robust action classification and semantic reasoning.
arXiv Detail & Related papers (2025-09-06T12:11:43Z) - Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.<n>In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.<n>We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z) - Text-Enhanced Zero-Shot Action Recognition: A training-free approach [13.074211474150914]
We propose Text-Enhanced Action Recognition (TEAR) for zero-shot video action recognition.
TEAR is training-free and does not require the availability of training data or extensive computational resources.
arXiv Detail & Related papers (2024-08-29T10:20:05Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.<n>To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.<n>Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Knowledge Prompting for Few-shot Action Recognition [20.973999078271483]
We propose a simple yet effective method, called knowledge prompting, to prompt a powerful vision-language model for few-shot classification.
We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base.
We feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame.
Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
arXiv Detail & Related papers (2022-11-22T06:05:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.