Multi-modal Prompting for Low-Shot Temporal Action Localization
- URL: http://arxiv.org/abs/2303.11732v1
- Date: Tue, 21 Mar 2023 10:40:13 GMT
- Title: Multi-modal Prompting for Low-Shot Temporal Action Localization
- Authors: Chen Ju, Zeqian Li, Peisen Zhao, Ya Zhang, Xiaopeng Zhang, Qi Tian,
Yanfeng Wang, Weidi Xie
- Abstract summary: We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
- Score: 95.19505874963751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the problem of temporal action localization under
low-shot (zero-shot & few-shot) scenario, with the goal of detecting and
classifying the action instances from arbitrary categories within some
untrimmed videos, even not seen at training time. We adopt a Transformer-based
two-stage action localization architecture with class-agnostic action proposal,
followed by open-vocabulary classification. We make the following
contributions. First, to compensate image-text foundation models with temporal
motions, we improve category-agnostic action proposal by explicitly aligning
embeddings of optical flows, RGB and texts, which has largely been ignored in
existing low-shot methods. Second, to improve open-vocabulary action
classification, we construct classifiers with strong discriminative power,
i.e., avoid lexical ambiguities. To be specific, we propose to prompt the
pre-trained CLIP text encoder either with detailed action descriptions
(acquired from large-scale language models), or visually-conditioned
instance-specific prompt vectors. Third, we conduct thorough experiments and
ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior
performance of our proposed model, outperforming existing state-of-the-art
approaches by one significant margin.
Related papers
- Open-Vocabulary Temporal Action Localization using Multimodal Guidance [67.09635853019005]
OVTAL enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories.
This flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference.
We introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions.
arXiv Detail & Related papers (2024-06-21T18:00:05Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Knowledge Prompting for Few-shot Action Recognition [20.973999078271483]
We propose a simple yet effective method, called knowledge prompting, to prompt a powerful vision-language model for few-shot classification.
We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base.
We feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame.
Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
arXiv Detail & Related papers (2022-11-22T06:05:17Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Spatio-temporal Relation Modeling for Few-shot Action Recognition [100.3999454780478]
We propose a few-shot action recognition framework, STRM, which enhances class-specific featureriminability while simultaneously learning higher-order temporal representations.
Our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
arXiv Detail & Related papers (2021-12-09T18:59:14Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.