Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization
- URL: http://arxiv.org/abs/2408.13777v1
- Date: Sun, 25 Aug 2024 09:07:06 GMT
- Title: Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization
- Authors: Jia-Run Du, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng,
- Abstract summary: Generalizable Action Proposal generator (GAP) is built in a query-based architecture and trained with a proposal-level objective.
Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions.
Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks.
- Score: 31.82121743586165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textit{unseen} categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.
Related papers
- STAT: Towards Generalizable Temporal Action Localization [56.634561073746056]
Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels.
Existing methods suffer from severe performance degradation when transferring to different distributions.
We propose GTAL, which focuses on improving the generalizability of action localization methods.
arXiv Detail & Related papers (2024-04-20T07:56:21Z) - ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal
Action Localization [31.314383098734922]
This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set.
It proposes a novel framework termed ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised action localization.
arXiv Detail & Related papers (2023-11-27T15:24:54Z) - Investigating the Limitation of CLIP Models: The Worst-Performing
Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Temporal Action Detection with Global Segmentation Mask Learning [134.26292288193298]
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video.
We propose a proposal-free Temporal Action detection model with Global mask (TAGS)
Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length.
arXiv Detail & Related papers (2022-07-14T00:46:51Z) - Adaptive Proposal Generation Network for Temporal Sentence Localization
in Videos [58.83440885457272]
We address the problem of temporal sentence localization in videos (TSLV)
Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals.
We propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency.
arXiv Detail & Related papers (2021-09-14T02:02:36Z) - Temporal Action Localization Using Gated Recurrent Units [6.091096843566857]
We propose a new network based on Gated Recurrent Unit (GRU) and two novel post-processing ideas for TAL task.
Specifically, we propose a new design for the output layer of the GRU resulting in the so-called GRU-Splitted model.
We evaluate the performance of the proposed method compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-08-07T06:25:29Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.