Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs
- URL: http://arxiv.org/abs/2510.03955v1
- Date: Sat, 04 Oct 2025 21:48:40 GMT
- Title: Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs
- Authors: Sameep Vani, Shreyas Jena, Maitreya Patel, Chitta Baral, Somak Aditya, Yezhou Yang,
- Abstract summary: We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
- Score: 54.502280390499756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading these models to rely heavily on language-based reasoning rather than truly understanding video dynamics. In this work, we propose TimeWarp, a systematic method to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video. We introduce a large-scale preference dataset, created using TimeWarp, that captures intricate temporal dynamics often overlooked, grounding the model's responses to visual and temporal information. We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks, highlighting the effectiveness of our proposed datasets in advancing temporal understanding in Video-LLMs, resulting in an absolute improvement in performance across seven benchmarks. Code is available at https://github.com/sameepv21/timewarp.
Related papers
- DATE: Dynamic Absolute Time Enhancement for Long Video Understanding [8.720269393713451]
Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
arXiv Detail & Related papers (2025-09-11T08:49:22Z) - Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data [100.5266292850922]
Strefer is a synthetic data generation framework designed to equip Video Large Models with referring and reasoning capabilities.<n>Strefer produces diverse instruction-generation data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata.<n>Our approach enhances the ability of Video LLMs to interpret to spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions.
arXiv Detail & Related papers (2025-09-03T17:33:20Z) - Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z) - DisTime: Distribution-based Time Representation for Video Large Language Models [23.176698643825123]
DisTime is a lightweight framework designed to enhance temporal comprehension in Video-LLMs.<n>DisTime employs a learnable token to create a continuous temporal embedding space.<n>DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks.
arXiv Detail & Related papers (2025-05-30T08:10:18Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Temporal Preference Optimization for Long-Form Video Understanding [63.196246578583136]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities.<n>We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations.<n>Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.