Related papers: VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

URL: http://arxiv.org/abs/2405.13382v2
Date: Mon, 1 Jul 2024 06:14:04 GMT
Title: VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
Authors: Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Chen, Bo Zhao,
Abstract summary: Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query. Video Large Language Models (video LLMs) have made significant progress in understanding video content, but they often face challenges in accurately pinpointing timestamps within videos. We propose a specially designed video LLM model for VTG tasks, VTG-LLM, which effectively integrates timestamp knowledge into visual tokens.
Score: 7.907951246007355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query, playing a vital role in downstream tasks such as video browsing and editing. While Video Large Language Models (video LLMs) have made significant progress in understanding video content, they often face challenges in accurately pinpointing timestamps within videos, which limits their performance on VTG tasks. Therefore, to improve video LLMs' ability to effectively locate timestamps, we argue that two critical aspects need to be enhanced. First, it is essential to have high-quality instructional tuning datasets that encompass mainstream VTG tasks. Second, directly incorporating timestamp knowledge into video LLMs is crucial, as it enables models to efficiently comprehend timestamp information. To address these needs, we first introduce VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval, dense video captioning, video summarization, and video highlight detection. Furthermore, we propose a specially designed video LLM model for VTG tasks, VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames. Comprehensive experiments showcase the superior performance of VTG-LLM in comparison to other video LLM methods across various VTG tasks. Our code and datasets are available at \url{https://github.com/gyxxyg/VTG-LLM}.

Related papers

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs [81.78017865436816]
We present TimeLens, a systematic investigation into building MLLMs with strong video temporal grounding ability.<n>We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks.<n>We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset.
arXiv Detail & Related papers (2025-12-16T18:59:58Z)
A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z)
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding [83.96715649130435]
We introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks.<n>Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications.
arXiv Detail & Related papers (2025-08-03T10:03:58Z)
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z)
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding [48.745013691038295]
VideoExpert is a general-purpose MLLM suitable for several temporal-sensitive video tasks. The Temporal Expert is responsible for modeling time sequences and performing temporal grounding. The Spatial Expert focuses on content detail analysis and instruction following. By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions.
arXiv Detail & Related papers (2025-04-10T07:33:39Z)
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension [83.00346826110041]
Video-RAG is a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment. Our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
arXiv Detail & Related papers (2024-11-20T07:44:34Z)
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z)
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning [42.928144657587325]
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding. TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM. In addition, we introduce the TimePro, a comprehensive grounding-centric instruction dataset composed of 9 tasks and 349k high-quality grounded annotations.
arXiv Detail & Related papers (2024-10-25T17:19:55Z)
Enhancing Temporal Modeling of Video LLMs via Time Gating [38.86742466948778]
Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering. Most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding. We propose a Time Gating Video LLM (TG-Vid) designed to enhance temporal modeling through a novel Time Gating module (TG)
arXiv Detail & Related papers (2024-10-08T06:21:29Z)
TRACE: Temporal Grounding Video LLM via Causal Event Modeling [6.596327795743185]
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos. This paper introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
arXiv Detail & Related papers (2024-10-08T02:46:30Z)
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge. In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases. We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities. Video LLMs can only provide a coarse description of the entire video. We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
UniVTG: Towards Unified Video-Language Temporal Grounding [52.56732639951834]
Video Temporal Grounding (VTG) aims to ground target clips from videos according to custom language queries. We propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions. Thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels.
arXiv Detail & Related papers (2023-07-31T14:34:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.