Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition
- URL: http://arxiv.org/abs/2312.02226v1
- Date: Mon, 4 Dec 2023 02:31:38 GMT
- Title: Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition
- Authors: Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han,
Mengmeng Wang, Guang Dai, Sizhe Dang, Jingdong Wang
- Abstract summary: Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
- Score: 63.95111791861103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploring open-vocabulary video action recognition is a promising venture,
which aims to recognize previously unseen actions within any arbitrary set of
categories. Existing methods typically adapt pretrained image-text models to
the video domain, capitalizing on their inherent strengths in generalization. A
common thread among such methods is the augmentation of visual embeddings with
temporal information to improve the recognition of seen actions. Yet, they
compromise with standard less-informative action descriptions, thus faltering
when confronted with novel actions. Drawing inspiration from human cognitive
processes, we argue that augmenting text embeddings with human prior knowledge
is pivotal for open-vocabulary video action recognition. To realize this, we
innovatively blend video models with Large Language Models (LLMs) to devise
Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to
produce a set of descriptive sentences that contain distinctive features for
identifying given actions. Building upon this foundation, we further introduce
a multi-modal action knowledge alignment mechanism to align concepts in video
and textual knowledge encapsulated within the prompts. Extensive experiments on
various video benchmarks, including zero-shot, few-shot, and base-to-novel
generalization settings, demonstrate that our method not only sets new SOTA
performance but also possesses excellent interpretability.
Related papers
- Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting [28.673734895558322]
We introduce a challenging Open-set Video-based Facial Expression Recognition task, aiming to identify both known and new, unseen facial expressions.
Existing approaches use large-scale vision-language models like CLIP to identify unseen classes.
We propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively.
arXiv Detail & Related papers (2024-04-26T01:21:08Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Implicit and Explicit Commonsense for Multi-sentence Video Captioning [33.969215964292395]
We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
arXiv Detail & Related papers (2023-03-14T00:19:11Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Knowledge Prompting for Few-shot Action Recognition [20.973999078271483]
We propose a simple yet effective method, called knowledge prompting, to prompt a powerful vision-language model for few-shot classification.
We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base.
We feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame.
Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
arXiv Detail & Related papers (2022-11-22T06:05:17Z) - CLOP: Video-and-Language Pre-Training with Knowledge Regularizations [43.09248976105326]
Video-and-language pre-training has shown promising results for learning generalizable representations.
We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities.
We propose a Cross-modaL knedgeOwl-enhanced Pre-training (CLOP) method with Knowledge Regularizations.
arXiv Detail & Related papers (2022-11-07T05:32:12Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z) - A Benchmark for Structured Procedural Knowledge Extraction from Cooking
Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos.
Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.