Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
- URL: http://arxiv.org/abs/2506.18883v1
- Date: Mon, 23 Jun 2025 17:53:18 GMT
- Title: Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
- Authors: Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie,
- Abstract summary: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
- Score: 59.781211641591405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.
Related papers
- Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - Vidi: Large Multimodal Models for Video Understanding and Editing [33.56852569192024]
We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios.<n>The first release focuses on temporal retrieval, identifying the time ranges within the input videos corresponding to a given text query.<n>We also present the VUE-TR benchmark, which introduces five key advancements.
arXiv Detail & Related papers (2025-04-22T08:04:45Z) - Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses [58.488812405557]
Video grounding aims to localize specific natural language queries in an untrimmed video.
We present a large-scale video grounding dataset named SynopGround.
We introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG)
arXiv Detail & Related papers (2024-08-03T05:35:13Z) - Chrono: A Simple Blueprint for Representing Time in MLLMs [34.036784478999245]
We investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos.<n>We introduce Chrono, a universal sequence blueprint that can be applied to an image-text pretrained MLLM.<n>We achieve a new SOTA in moment retrieval on the most widely used benchmarks Charades-STA, QVHighlights, ActivityNet Captions, and grounded video question answering on NeXT-GQA.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding [20.037781644877388]
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding.
Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths.
arXiv Detail & Related papers (2023-12-04T17:09:52Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.