UniVTG: Towards Unified Video-Language Temporal Grounding
- URL: http://arxiv.org/abs/2307.16715v2
- Date: Fri, 18 Aug 2023 07:56:32 GMT
- Title: UniVTG: Towards Unified Video-Language Temporal Grounding
- Authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick,
Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou
- Abstract summary: Video Temporal Grounding (VTG) aims to ground target clips from videos according to custom language queries.
We propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions.
Thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels.
- Score: 52.56732639951834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language
queries (e.g., sentences or words), is key for video browsing on social media.
Most methods in this direction develop taskspecific models that are trained
with type-specific labels, such as moment retrieval (time interval) and
highlight detection (worthiness curve), which limits their abilities to
generalize to various VTG tasks and labels. In this paper, we propose to Unify
the diverse VTG labels and tasks, dubbed UniVTG, along three directions:
Firstly, we revisit a wide range of VTG labels and tasks and define a unified
formulation. Based on this, we develop data annotation schemes to create
scalable pseudo supervision. Secondly, we develop an effective and flexible
grounding model capable of addressing each task and making full use of each
label. Lastly, thanks to the unified framework, we are able to unlock temporal
grounding pretraining from large-scale diverse labels and develop stronger
grounding abilities e.g., zero-shot grounding. Extensive experiments on three
tasks (moment retrieval, highlight detection and video summarization) across
seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights,
TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed
framework. The codes are available at https://github.com/showlab/UniVTG.
Related papers
- ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - VLG: General Video Recognition with Web Textual Knowledge [47.3660792813967]
We focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework.
By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we present a unified visual-linguistic framework (VLG)
Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
arXiv Detail & Related papers (2022-12-03T15:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.