SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning
- URL: http://arxiv.org/abs/2506.00835v1
- Date: Sun, 01 Jun 2025 04:51:49 GMT
- Title: SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning
- Authors: Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu,
- Abstract summary: We leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning.<n>We propose a novel optimization method offering significant advantages over DPO and its variants.<n>Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20% improvement in training efficiency.
- Score: 69.34975070207763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model's language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20\% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO
Related papers
- AVC-DPO: Aligned Video Captioning via Direct Preference Optimization [50.08618093204503]
Video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks.<n>We propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment.<n>We have achieved exceptional performance in the LOVE@PRCV'25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning benchmark.
arXiv Detail & Related papers (2025-07-02T08:51:45Z) - DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models [60.716734545171114]
We introduce DenseDPO, a method that addresses shortcomings by making three contributions.<n>First, we create each video pair for DPO by denoising corrupted copies of a ground truth video.<n>Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal.
arXiv Detail & Related papers (2025-06-04T03:06:08Z) - VPO: Aligning Text-to-Video Generation Models with Prompt Optimization [80.86205966195593]
Video generation models are typically trained on text-to-video pairs with highly detailed and carefully crafted descriptions.<n>We introduce VPO, a principled framework that optimize prompts based on three core principles: harmlessness, accuracy, and helpfulness.<n>Our experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods.
arXiv Detail & Related papers (2025-03-26T12:28:20Z) - TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment [48.94844127553743]
TEMPLE is a systematic framework that enhances temporal reasoning capabilities of Video Large Language Models.<n>Our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data.<n>Our findings highlight our TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.
arXiv Detail & Related papers (2025-03-21T08:00:29Z) - IPO: Iterative Preference Optimization for Text-to-Video Generation [10.625127393884462]
We introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback.<n> IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring.<n>In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling.
arXiv Detail & Related papers (2025-02-04T08:14:34Z) - VideoDPO: Omni-Preference Alignment for Video Diffusion Generation [48.36302380755874]
Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.<n>We propose a VideoDPO pipeline by making several key adjustments.<n>Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
arXiv Detail & Related papers (2024-12-18T18:59:49Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.<n>VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.<n>Our model-agnostic approach is computationally efficient, requiring only 32 frames.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.