Open-Vocabulary Temporal Action Localization using Multimodal Guidance
- URL: http://arxiv.org/abs/2406.15556v1
- Date: Fri, 21 Jun 2024 18:00:05 GMT
- Title: Open-Vocabulary Temporal Action Localization using Multimodal Guidance
- Authors: Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Graham W. Taylor,
- Abstract summary: OVTAL enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories.
This flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference.
We introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions.
- Score: 67.09635853019005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism to learn the alignment between class representations and frame-level video features, facilitating the multimodal guided features. Third, we propose a two-stage training strategy which includes training with a larger vocabulary dataset and finetuning to downstream data to generalize to novel categories. OVFormer extends existing TAL methods to open-vocabulary settings. Comprehensive evaluations on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our method. Code and pretrained models will be publicly released.
Related papers
- LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction [63.668635390907575]
Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs)
We propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector.
arXiv Detail & Related papers (2024-07-16T02:58:33Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Exploration of visual prompt in Grounded pre-trained open-set detection [6.560519631555968]
We propose a novel visual prompt method that learns new category knowledge from a few labeled images.
We evaluate the method on the ODinW dataset and show that it outperforms existing prompt learning methods.
arXiv Detail & Related papers (2023-12-14T11:52:35Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - A Simple Meta-learning Paradigm for Zero-shot Intent Classification with
Mixture Attention Mechanism [17.228616743739412]
We propose a simple yet effective meta-learning paradigm for zero-shot intent classification.
To learn better semantic representations for utterances, we introduce a new mixture attention mechanism.
To strengthen the transfer ability of the model from seen classes to unseen classes, we reformulate zero-shot intent classification with a meta-learning strategy.
arXiv Detail & Related papers (2022-06-05T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.