Related papers: Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

URL: http://arxiv.org/abs/2511.21202v1
Date: Wed, 26 Nov 2025 09:32:06 GMT
Title: Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Authors: Baoli Sun, Yihan Wang, Xinzhu Ma, Zhihui Wang, Kun Lu, Zhiyong Wang,
Abstract summary: Action-Region Tracking (ART) is a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details.<n>We propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries.<n>Experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.
Score: 35.62986006054654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

Related papers

Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning [56.6025512458557]
Motion-language retrieval aims to bridge the semantic gap between natural language and human motion.<n>Existing approaches predominantly focus on aligning entire motion sequences with global textual representations.<n>We propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval.
arXiv Detail & Related papers (2026-01-29T16:00:12Z)
Watch Where You Move: Region-aware Dynamic Aggregation and Excitation for Gait Recognition [55.52723195212868]
GaitRDAE is a framework that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention.<n> Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.
arXiv Detail & Related papers (2025-10-18T15:36:08Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection [7.202931445597171]
We present a novel network that detects actions in untrimmed videos. The network encodes the locations of action semantics in video frames utilizing motion-aware 2D positional encoding. The approach outperforms the state-the-art solutions on four proposed datasets.
arXiv Detail & Related papers (2024-05-13T21:47:35Z)
Motion-state Alignment for Video Semantic Segmentation [4.375012768093524]
We propose a novel motion-state alignment framework for video semantic segmentation. The proposed method picks up dynamic and static semantics in a targeted way. Experiments on Cityscapes and CamVid datasets show that the proposed approach outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-04-18T08:34:46Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video. We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
Exploiting Visual Semantic Reasoning for Video-Text Retrieval [14.466809435818984]
We propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. We perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed.
arXiv Detail & Related papers (2020-06-16T02:56:46Z)
Retrieving and Highlighting Action with Spatiotemporal Reference [15.283548146322971]
We present a framework that jointly retrieves andtemporally highlights actions in videos. Our work takes on the novel task of highlighting action highlighting, which visualizes where and when actions occur in an un video setting.
arXiv Detail & Related papers (2020-05-19T03:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.