MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization
- URL: http://arxiv.org/abs/2511.13039v1
- Date: Mon, 17 Nov 2025 06:40:02 GMT
- Title: MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization
- Authors: Zhenying Fang, Richang Hong,
- Abstract summary: OV-TAL aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories.<n>Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories.<n>We propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier.
- Score: 51.56484100374058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.
Related papers
- Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z) - Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization [31.82121743586165]
Generalizable Action Proposal generator (GAP) is built in a query-based architecture and trained with a proposal-level objective.
Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions.
Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks.
arXiv Detail & Related papers (2024-08-25T09:07:06Z) - Classification Matters: Improving Video Action Detection with Class-Specific Attention [61.14469113965433]
Video action detection (VAD) aims to detect actors and classify their actions in a video.
We analyze how prevailing methods form features for classification and find that they prioritize actor regions.
We propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class.
arXiv Detail & Related papers (2024-07-29T04:43:58Z) - Open-Vocabulary Temporal Action Localization using Multimodal Guidance [67.09635853019005]
OVTAL enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories.
This flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference.
We introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions.
arXiv Detail & Related papers (2024-06-21T18:00:05Z) - Dual-Modal Prompting for Sketch-Based Image Retrieval [76.12076969949062]
We propose a dual-modal CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed.
We employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales.
Our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot method by 7.3% in Acc.@1 on the Sketchy dataset.
arXiv Detail & Related papers (2024-04-29T13:43:49Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Unifying Few- and Zero-Shot Egocentric Action Recognition [3.1368611610608848]
We propose a new set of splits derived from the EPIC-KITCHENS dataset that allow evaluation of open-set classification.
We show that adding a metric-learning loss to the conventional direct-alignment baseline can improve zero-shot classification by as much as 10%.
arXiv Detail & Related papers (2020-05-27T02:23:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.