DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
- URL: http://arxiv.org/abs/2506.03517v1
- Date: Wed, 04 Jun 2025 03:06:08 GMT
- Title: DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
- Authors: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin,
- Abstract summary: We introduce DenseDPO, a method that addresses shortcomings by making three contributions.<n>First, we create each video pair for DPO by denoising corrupted copies of a ground truth video.<n>Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal.
- Score: 60.716734545171114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
Related papers
- AVC-DPO: Aligned Video Captioning via Direct Preference Optimization [50.08618093204503]
Video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks.<n>We propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment.<n>We have achieved exceptional performance in the LOVE@PRCV'25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning benchmark.
arXiv Detail & Related papers (2025-07-02T08:51:45Z) - SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning [69.34975070207763]
We leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning.<n>We propose a novel optimization method offering significant advantages over DPO and its variants.<n>Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20% improvement in training efficiency.
arXiv Detail & Related papers (2025-06-01T04:51:49Z) - VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models [80.92928946973026]
We introduce VistaDPO, a framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization.<n> VistaDPO enhances text-video preference alignment across three hierarchical levels.<n>Experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs.
arXiv Detail & Related papers (2025-04-17T17:39:41Z) - Discriminator-Free Direct Preference Optimization for Video Diffusion [25.304451979598863]
We propose a discriminator-free video DPO framework that uses original real videos as win cases and edited versions as lose cases.<n>We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions.
arXiv Detail & Related papers (2025-04-11T13:55:48Z) - PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning [50.81779197183613]
Direct Preference Optimization (DPO) helps reduce hallucinations in Video Multimodal Large Language Models (VLLMs)<n>We propose Video Direct Preference Optimization (VDPO), an online preference learning framework that eliminates the need for preference annotation.<n>We introduce Prompt-aware Multi-instance Learning VDPO, which selects augmentations based on prompt context.
arXiv Detail & Related papers (2025-04-08T08:41:41Z) - Dual Caption Preference Optimization for Diffusion Models [51.223275938663235]
We propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts.<n>Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics.
arXiv Detail & Related papers (2025-02-09T20:34:43Z) - Temporal Preference Optimization for Long-Form Video Understanding [28.623353303256653]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - VideoDPO: Omni-Preference Alignment for Video Diffusion Generation [48.36302380755874]
Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.<n>We propose a VideoDPO pipeline by making several key adjustments.<n>Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
arXiv Detail & Related papers (2024-12-18T18:59:49Z) - Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization [19.327911862822262]
We present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA)
We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimized using directed preference optimization (DPO)
Experiments show that mrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing global and local error rates by 40% and 20%, respectively.
arXiv Detail & Related papers (2024-10-09T08:44:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.