Related papers: Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

URL: http://arxiv.org/abs/2601.04068v2
Date: Thu, 08 Jan 2026 02:51:26 GMT
Title: Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
Authors: Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo,
Abstract summary: LocalDPO builds a novel framework for aligning video diffusion models with human preferences.<n>We show that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches.
Score: 65.16788152626499
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

Related papers

Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation [19.119239411510936]
We introduce a GT-Pair that builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives.<n>We also present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO loss to enhance training stability and generation fidelity.
arXiv Detail & Related papers (2025-11-03T11:04:22Z)
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models [92.36630583208647]
We introduce DenseDPO, a method that addresses shortcomings by making three contributions.<n>First, we create each video pair for DPO by denoising corrupted copies of a ground truth video.<n>Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal.
arXiv Detail & Related papers (2025-06-04T03:06:08Z)
Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences [13.588231827053923]
Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data.<n>We propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective.<n>Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods.
arXiv Detail & Related papers (2025-06-03T09:47:22Z)
SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning [69.34975070207763]
We leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning.<n>We propose a novel optimization method offering significant advantages over DPO and its variants.<n>Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20% improvement in training efficiency.
arXiv Detail & Related papers (2025-06-01T04:51:49Z)
Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations [60.143658714894336]
Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation.<n> Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences.<n>We introduce Self-NPO, a Negative Preference Optimization approach that learns exclusively from the model itself.
arXiv Detail & Related papers (2025-05-17T01:03:46Z)
Discriminator-Free Direct Preference Optimization for Video Diffusion [25.304451979598863]
We propose a discriminator-free video DPO framework that uses original real videos as win cases and edited versions as lose cases.<n>We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions.
arXiv Detail & Related papers (2025-04-11T13:55:48Z)
Temporal Preference Optimization for Long-Form Video Understanding [63.196246578583136]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z)
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.