Generative Model-Based Feature Attention Module for Video Action Analysis
- URL: http://arxiv.org/abs/2508.13565v1
- Date: Tue, 19 Aug 2025 06:56:02 GMT
- Title: Generative Model-Based Feature Attention Module for Video Action Analysis
- Authors: Guiqin Wang, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo, Shusen Yang,
- Abstract summary: We propose a novel generative attention-based model to learn the relation of feature semantics.<n>Our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics.<n>We substantiate the superiority of our approach through comprehensive validation on widely recognized datasets.
- Score: 12.406387678107759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.
Related papers
- From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection [60.11169426478452]
This paper aims to introduce fixation information to assist the detection of salient objects under weak supervision.<n>We propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process.<n>An Intra-Inter Mixed Contrastive (MCII) model improves thetemporal modeling capabilities under weak supervision.
arXiv Detail & Related papers (2025-06-30T05:01:40Z) - Feature Hallucination for Self-supervised Action Recognition [37.20267786858476]
We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames.<n>Our framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.
arXiv Detail & Related papers (2025-06-25T11:50:23Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Generative Model-based Feature Knowledge Distillation for Action
Recognition [11.31068233536815]
Our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model.
The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets.
arXiv Detail & Related papers (2023-12-14T03:55:29Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - FasterVideo: Efficient Online Joint Object Detection And Tracking [0.8680676599607126]
We re-think one of the most successful methods for image object detection, Faster R-CNN, and extend it to the video domain.
Our proposed method reaches a very high computational efficiency necessary for relevant applications.
arXiv Detail & Related papers (2022-04-15T09:25:34Z) - Continuous Human Action Recognition for Human-Machine Interaction: A
Review [39.593687054839265]
Recognising actions within an input video are challenging but necessary tasks for applications that require real-time human-machine interaction.
We provide on the feature extraction and learning strategies that are used on most state-of-the-art methods.
We investigate the application of such models to real-world scenarios and discuss several limitations and key research directions.
arXiv Detail & Related papers (2022-02-26T09:25:44Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z) - A Dependency Syntactic Knowledge Augmented Interactive Architecture for
End-to-End Aspect-based Sentiment Analysis [73.74885246830611]
We propose a novel dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-to-end ABSA.
This model is capable of fully exploiting the syntactic knowledge (dependency relations and types) by leveraging a well-designed Dependency Relation Embedded Graph Convolutional Network (DreGcn)
Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-04T14:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.