Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning
- URL: http://arxiv.org/abs/2601.04153v1
- Date: Wed, 07 Jan 2026 18:05:08 GMT
- Title: Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning
- Authors: Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag,
- Abstract summary: Diffusion-DRF is a differentiable reward flow for fine-tuning video diffusion models.<n>It backpropagates VLM feedback through the diffusion denoising chain.<n>It improves video quality and semantic alignment while mitigating reward hacking and collapse.
- Score: 72.16213872139748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.
Related papers
- Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization [63.37868191173104]
We propose an approach that continuously adapts a pre-trained diffusion model to a video stream.<n>We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO)<n> Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube.
arXiv Detail & Related papers (2025-11-23T02:58:10Z) - Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [62.653984010274485]
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions.<n> prevailingAs either generate actions auto-regressively in a fixed left-to-right order or attach separate or diffusion heads outside the backbone.<n>We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion.
arXiv Detail & Related papers (2025-08-27T17:39:11Z) - NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows [75.70583906344815]
Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions.<n>We present NinA, a fast and expressive alternative to diffusion-based decoders for Vision-Language-Action (VLA) models.
arXiv Detail & Related papers (2025-08-23T00:02:15Z) - Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets [65.42834731617226]
We propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet.<n>We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model.
arXiv Detail & Related papers (2024-12-10T18:59:58Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
ACDiT is a blockwise Conditional Diffusion Transformer.<n>It offers a flexible between token-wise autoregression and full-sequence diffusion.<n>We show that ACDiT performs best among all autoregressive baselines on image and video generation tasks.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Exploring Iterative Refinement with Diffusion Models for Video Grounding [17.435735275438923]
Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query.
We propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task.
arXiv Detail & Related papers (2023-10-26T07:04:44Z) - Refined Semantic Enhancement towards Frequency Diffusion for Video
Captioning [29.617527535279574]
Video captioning aims to generate natural language sentences that describe the given video accurately.
Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability.
We introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens.
arXiv Detail & Related papers (2022-11-28T05:45:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.