Related papers: LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

URL: http://arxiv.org/abs/2602.02341v1
Date: Mon, 02 Feb 2026 17:03:37 GMT
Title: LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Authors: Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng, Lingxue Song, Xi Chen, Liang Li, Limin Wang,
Abstract summary: LongVPO is a framework that enables vision-context models to robustly understand ultra-long videos without any long-video annotations.<n>With only 16K synthetic examples and no costly human labels, LongVPO outperforms state-of-the-art open-source models on multiple long-video benchmarks.
Score: 20.692871849527815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.

Related papers

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models [28.68367581677484]
We introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-basedtemporal video compressor (SVC) integrated with a multimodal large language model (MLLM)<n>Our proposed system offers two major advantages: it adaptively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information.
arXiv Detail & Related papers (2026-02-19T22:04:27Z)
From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding [43.82717677801915]
Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks.<n>Their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window.<n>We show that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding.
arXiv Detail & Related papers (2025-10-02T17:43:01Z)
Temporal Preference Optimization for Long-Form Video Understanding [63.196246578583136]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.