DisTime: Distribution-based Time Representation for Video Large Language Models
- URL: http://arxiv.org/abs/2505.24329v2
- Date: Thu, 31 Jul 2025 03:03:18 GMT
- Title: DisTime: Distribution-based Time Representation for Video Large Language Models
- Authors: Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu,
- Abstract summary: DisTime is a lightweight framework designed to enhance temporal comprehension in Video-LLMs.<n>DisTime employs a learnable token to create a continuous temporal embedding space.<n>DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks.
- Score: 23.176698643825123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.
Related papers
- Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z) - VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding [48.745013691038295]
VideoExpert is a general-purpose MLLM suitable for several temporal-sensitive video tasks.<n>The Temporal Expert is responsible for modeling time sequences and performing temporal grounding.<n>The Spatial Expert focuses on content detail analysis and instruction following.<n>By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions.
arXiv Detail & Related papers (2025-04-10T07:33:39Z) - TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs [55.23558461306722]
Video large language models have achieved remarkable performance in tasks such as video question answering.<n>Our dataset focuses on enhancing temporal comprehension across five key dimensions.<n>We introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets.
arXiv Detail & Related papers (2025-03-13T03:05:11Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - TimeRefine: Temporal Grounding with Time Refining Video LLM [75.99665302872901]
Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt.<n>We reformulate the temporal grounding task as a temporal refining task.<n>We incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth.
arXiv Detail & Related papers (2024-12-12T18:59:11Z) - Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.<n>Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.<n>To facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM.
arXiv Detail & Related papers (2024-12-04T00:50:33Z) - TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability [26.376975842846235]
We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization.<n>TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos.<n>It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos.
arXiv Detail & Related papers (2024-11-27T10:45:40Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.