Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation
- URL: http://arxiv.org/abs/2511.01450v3
- Date: Mon, 10 Nov 2025 03:10:25 GMT
- Title: Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation
- Authors: Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, Jun Zhang,
- Abstract summary: We introduce a GT-Pair that builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives.<n>We also present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO loss to enhance training stability and generation fidelity.
- Score: 19.119239411510936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO loss to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.
Related papers
- TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment [28.18756041538092]
We present TAGRPO, a robust framework for I2V models inspired by contrastive learning.<n>Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization.
arXiv Detail & Related papers (2026-01-09T11:15:27Z) - Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models [65.16788152626499]
LocalDPO builds a novel framework for aligning video diffusion models with human preferences.<n>We show that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches.
arXiv Detail & Related papers (2026-01-07T16:32:17Z) - DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models [1.972901110298768]
We propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning framework for video editing.<n>In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos.<n>The second stage introduces a vision-friendly adapter to improve visual quality.
arXiv Detail & Related papers (2025-05-11T17:08:50Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - VPO: Aligning Text-to-Video Generation Models with Prompt Optimization [105.1387607806783]
Video generation models are typically trained on text-to-video pairs with highly detailed and carefully crafted descriptions.<n>We introduce VPO, a principled framework that optimize prompts based on three core principles: harmlessness, accuracy, and helpfulness.<n>Our experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods.
arXiv Detail & Related papers (2025-03-26T12:28:20Z) - TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment [48.94844127553743]
TEMPLE is a systematic framework that enhances temporal reasoning capabilities of Video Large Language Models.<n>Our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data.<n>Our findings highlight our TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.
arXiv Detail & Related papers (2025-03-21T08:00:29Z) - IPO: Iterative Preference Optimization for Text-to-Video Generation [10.625127393884462]
We introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback.<n> IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring.<n>In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling.
arXiv Detail & Related papers (2025-02-04T08:14:34Z) - HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment [13.320911720001277]
We introduce the strategy of Direct Preference Optimization (DPO) into text-to-video (T2V) tasks.<n>Existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences.
arXiv Detail & Related papers (2025-02-02T16:55:42Z) - Temporal Preference Optimization for Long-Form Video Understanding [63.196246578583136]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.<n>TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.<n>LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.