Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video
Relation Detection
- URL: http://arxiv.org/abs/2302.00268v1
- Date: Wed, 1 Feb 2023 06:20:54 GMT
- Title: Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video
Relation Detection
- Authors: Kaifeng Gao, Long Chen, Hanwang Zhang, Jun Xiao, Qianru Sun
- Abstract summary: We present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD)
RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse predicate-temporal motion patterns of the subject-object compositions.
- Score: 67.64272825961395
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Prompt tuning with large-scale pretrained vision-language models empowers
open-vocabulary predictions trained on limited base categories, e.g., object
classification and detection. In this paper, we propose compositional prompt
tuning with motion cues: an extended prompt tuning paradigm for compositional
predictions of video data. In particular, we present Relation Prompt (RePro)
for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where
conventional prompt tuning is easily biased to certain subject-object
combinations and motion patterns. To this end, RePro addresses the two
technical challenges of Open-VidVRD: 1) the prompt tokens should respect the
two different semantic roles of subject and object, and 2) the tuning should
account for the diverse spatio-temporal motion patterns of the subject-object
compositions. Without bells and whistles, our RePro achieves a new
state-of-the-art performance on two VidVRD benchmarks of not only the base
training object and predicate categories, but also the unseen ones. Extensive
ablations also demonstrate the effectiveness of the proposed compositional and
multi-mode design of prompts. Code is available at
https://github.com/Dawn-LX/OpenVoc-VidVRD.
Related papers
- GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation [41.67544072483324]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence throughout the entire video.
We propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences.
With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts.
arXiv Detail & Related papers (2024-06-18T17:54:17Z) - DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control [48.41743234012456]
DisenStudio is a novel framework that can generate text-guided videos for customized multiple subjects.
DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism.
We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
arXiv Detail & Related papers (2024-05-21T13:44:55Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [39.722927180264584]
We propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously.
To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning scheme is proposed.
arXiv Detail & Related papers (2022-08-17T15:06:36Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.