Related papers: Temporal Preference Optimization for Long-Form Video Understanding

Temporal Preference Optimization for Long-Form Video Understanding

URL: http://arxiv.org/abs/2501.13919v2
Date: Thu, 30 Jan 2025 17:35:08 GMT
Title: Temporal Preference Optimization for Long-Form Video Understanding
Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy,
Abstract summary: Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
Score: 28.623353303256653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.

Related papers

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models [28.68367581677484]
We introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-basedtemporal video compressor (SVC) integrated with a multimodal large language model (MLLM)<n>Our proposed system offers two major advantages: it adaptively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information.
arXiv Detail & Related papers (2026-02-19T22:04:27Z)
Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [26.463523465270097]
Multi-language Large Language Models (MLLMs) have demonstrated significant progress in vision-based tasks, yet they still face challenges when processing long-duration video inputs.<n>We propose Temporal Policy Sampling Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z)
Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z)
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models [80.92928946973026]
We introduce VistaDPO, a framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels. Experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs.
arXiv Detail & Related papers (2025-04-17T17:39:41Z)
Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z)
TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment [48.94844127553743]
TEMPLE is a systematic framework that enhances temporal reasoning capabilities of Video Large Language Models. Our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight our TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.
arXiv Detail & Related papers (2025-03-21T08:00:29Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z)
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding [25.21011724370177]
Text-guided Video Temporal Grounding (VTG) aims to localize relevant segments in untrimmed videos based on textual descriptions.<n>We introduce FlashVTG, a framework featuring a Temporal Feature Layering (TFL) module and an Adaptive Score Refinement (ASR) module.<n>FlashVTG achieves state-of-the-art performance on four widely adopted datasets in both Moment Retrieval (MR) and Highlight Detection (HD)
arXiv Detail & Related papers (2024-12-18T02:23:33Z)
Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.<n>Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.
arXiv Detail & Related papers (2024-12-04T00:50:33Z)
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge. In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z)
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content. We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis. We characterize the limitations and potential of current video-language benchmarks. We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.