Related papers: DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

URL: http://arxiv.org/abs/2509.09263v1
Date: Thu, 11 Sep 2025 08:49:22 GMT
Title: DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Authors: Chao Yuan, Yang Yang, Yehui Yang, Zach Cheng,
Abstract summary: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
Score: 8.720269393713451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

Related papers

CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models [66.56549019393042]
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order.<n>We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context.
arXiv Detail & Related papers (2026-01-08T10:03:07Z)
MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding [40.37010049965347]
Referring Video Object (RefVOS) seeks to segment target objects in videos guided by natural language descriptions.<n>We propose a unified framework that jointly optimize Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability.
arXiv Detail & Related papers (2025-10-10T11:18:21Z)
Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding [18.027290155746112]
Temporal Search is a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively.<n>It is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy.<n>It refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos.
arXiv Detail & Related papers (2025-06-28T15:24:05Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
DisTime: Distribution-based Time Representation for Video Large Language Models [23.176698643825123]
DisTime is a lightweight framework designed to enhance temporal comprehension in Video-LLMs.<n>DisTime employs a learnable token to create a continuous temporal embedding space.<n>DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks.
arXiv Detail & Related papers (2025-05-30T08:10:18Z)
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment [0.0]
We propose UMaT, a framework that unifies visual and auditory inputs as structured text for large language models.<n>It significantly improves state-of-the-art Long Video Question Answering accuracy.
arXiv Detail & Related papers (2025-03-12T05:28:24Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Temporal Preference Optimization for Long-Form Video Understanding [63.196246578583136]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z)
Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities.<n>We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations.<n>Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation. Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.