TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
- URL: http://arxiv.org/abs/2503.13377v1
- Date: Mon, 17 Mar 2025 17:04:20 GMT
- Title: TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
- Authors: Ye Wang, Boshen Xu, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, Qin Jin,
- Abstract summary: We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task.<n>TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning.<n>We conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA.
- Score: 63.126150646467295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task. This task requires precisely localizing relevant video segments within long videos based on a given language query. TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning. To evaluate the effectiveness of TimeZero, we conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA. Code is available at https://github.com/www-Ye/TimeZero.
Related papers
- TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos [50.04992164981131]
Temporal localization in untrimmed videos is crucial for video understanding but remains challenging.<n>This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection.<n>We propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks.
arXiv Detail & Related papers (2025-03-09T09:11:26Z) - Enhancing Temporal Modeling of Video LLMs via Time Gating [38.86742466948778]
Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering.
Most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding.
We propose a Time Gating Video LLM (TG-Vid) designed to enhance temporal modeling through a novel Time Gating module (TG)
arXiv Detail & Related papers (2024-10-08T06:21:29Z) - LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding
in Long Videos [60.86880787242561]
Video temporal grounding aims to pinpoint a video segment that matches the query description.
We propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with textbfone-time network execution.
Our method significantly outperforms state-of-the-arts, and achieves textbf14.6$times$ / textbf102.8$times$ higher efficiency respectively.
arXiv Detail & Related papers (2023-03-15T03:54:43Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.