VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT
- URL: http://arxiv.org/abs/2403.02076v1
- Date: Mon, 4 Mar 2024 14:22:02 GMT
- Title: VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT
- Authors: Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, and Sidan Du
- Abstract summary: Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.
Most existing VTG models are trained on extensive annotated video-text pairs.
We propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning.
- Score: 1.614471032380076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video temporal grounding (VTG) aims to locate specific temporal segments from
an untrimmed video based on a linguistic query. Most existing VTG models are
trained on extensive annotated video-text pairs, a process that not only
introduces human biases from the queries but also incurs significant
computational costs. To tackle these challenges, we propose VTG-GPT, a
GPT-based method for zero-shot VTG without training or fine-tuning. To reduce
prejudice in the original query, we employ Baichuan2 to generate debiased
queries. To lessen redundant information in videos, we apply MiniGPT-v2 to
transform visual content into more precise captions. Finally, we devise the
proposal generator and post-processing to produce accurate segments from
debiased queries and image captions. Extensive experiments demonstrate that
VTG-GPT significantly outperforms SOTA methods in zero-shot settings and
surpasses unsupervised approaches. More notably, it achieves competitive
performance comparable to supervised methods. The code is available on
https://github.com/YoucanBaby/VTG-GPT
Related papers
- Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding [21.39095611185205]
Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query.<n>We propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG.
arXiv Detail & Related papers (2025-11-28T12:57:36Z) - VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations [59.40631942092535]
Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries.<n>Recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL)<n>We propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training.
arXiv Detail & Related papers (2025-10-27T14:55:38Z) - TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding [83.96715649130435]
We introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks.<n>Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications.
arXiv Detail & Related papers (2025-08-03T10:03:58Z) - Video-GPT via Next Clip Diffusion [14.832916520268105]
GPT has shown its remarkable success in natural language processing.<n>We treat video as new language for visual world modeling.<n>We introduce a novel next clip diffusion paradigm for pretraining Video-GPT.
arXiv Detail & Related papers (2025-05-18T16:22:58Z) - Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization [63.37161241355025]
Video-MSG is a training-free method for T2V generation based on Multimodal planning and Structured noise initialization.
It guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising.
Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models.
arXiv Detail & Related papers (2025-04-11T15:41:43Z) - Number it: Temporal Grounding Videos like Flipping Manga [45.50403831692172]
Number-Prompt (NumPro) is a method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding.
Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence.
Experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost.
arXiv Detail & Related papers (2024-11-15T16:32:34Z) - ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding [90.21119832796136]
Temporal Video Grounding aims to localize a moment from an untrimmed video given the language description.
To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG.
arXiv Detail & Related papers (2024-06-11T09:31:37Z) - VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding [7.907951246007355]
Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query.
Video Large Language Models (video LLMs) have made significant progress in understanding video content, but they often face challenges in accurately pinpointing timestamps within videos.
We propose a specially designed video LLM model for VTG tasks, VTG-LLM, which effectively integrates timestamp knowledge into visual tokens.
arXiv Detail & Related papers (2024-05-22T06:31:42Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval [73.82017200889906]
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query.
We propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention.
In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts.
arXiv Detail & Related papers (2024-01-19T09:58:06Z) - UniVTG: Towards Unified Video-Language Temporal Grounding [52.56732639951834]
Video Temporal Grounding (VTG) aims to ground target clips from videos according to custom language queries.
We propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions.
Thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels.
arXiv Detail & Related papers (2023-07-31T14:34:49Z) - Video Moment Retrieval from Text Queries via Single Frame Annotation [65.92224946075693]
Video moment retrieval aims at finding the start and end timestamps of a moment described by a given natural language query.
Fully supervised methods need complete temporal boundary annotations to achieve promising results.
We propose a new paradigm called "glance annotation"
arXiv Detail & Related papers (2022-04-20T11:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.