TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
- URL: http://arxiv.org/abs/2508.01699v1
- Date: Sun, 03 Aug 2025 10:03:58 GMT
- Title: TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
- Authors: Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, Song Bai,
- Abstract summary: We introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks.<n>Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications.
- Score: 83.96715649130435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
Related papers
- VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding [48.745013691038295]
VideoExpert is a general-purpose MLLM suitable for several temporal-sensitive video tasks.<n>The Temporal Expert is responsible for modeling time sequences and performing temporal grounding.<n>The Spatial Expert focuses on content detail analysis and instruction following.<n>By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions.
arXiv Detail & Related papers (2025-04-10T07:33:39Z) - TRACE: Temporal Grounding Video LLM via Causal Event Modeling [6.596327795743185]
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing.<n>Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos.<n>This paper introduces causal event modeling framework, which represents video LLM outputs as sequences of events, and predict the current event using previous events, video inputs, and textural instructions.<n>We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
arXiv Detail & Related papers (2024-10-08T02:46:30Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding [10.548950058205833]
Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries.<n>Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner.<n>We introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities.
arXiv Detail & Related papers (2024-05-22T06:31:42Z) - UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - UniVTG: Towards Unified Video-Language Temporal Grounding [52.56732639951834]
Video Temporal Grounding (VTG) aims to ground target clips from videos according to custom language queries.
We propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions.
Thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels.
arXiv Detail & Related papers (2023-07-31T14:34:49Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.