Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
- URL: http://arxiv.org/abs/2512.04678v1
- Date: Thu, 04 Dec 2025 11:12:13 GMT
- Title: Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
- Authors: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang,
- Abstract summary: We introduce Reward Forcing, a novel framework for efficient streaming video generation.<n> EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying.<n>Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.
- Score: 69.57572900337176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
Related papers
- Transition Matching Distillation for Fast Video Generation [63.1049790376783]
We present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators.<n>TMD matches the multi-step denoising trajectory of a diffusion model with a few-step probability transition process.<n>TMD provides a flexible and strong trade-off between generation speed and visual quality.
arXiv Detail & Related papers (2026-01-14T21:30:03Z) - Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models [11.913945404405865]
Most video diffusion models (VDMs) generate videos in an autoregressive manner, generating subsequent iteration frames conditioned on previous ones.<n>We propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs.
arXiv Detail & Related papers (2025-11-15T08:29:14Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - Taming generative video models for zero-shot optical flow extraction [28.176290134216995]
Self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow.<n>Inspired by the Counterfactual World Model (CWM) paradigm, we extend this idea to generative video models.<n> KL-tracing is a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and ungenerative predictive distributions.
arXiv Detail & Related papers (2025-07-11T23:59:38Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Playing with Transformer at 30+ FPS via Next-Frame Diffusion [40.04104312955399]
Next-Frame Diffusion (NFD) is an autoregressive diffusion transformer that incorporates block-wise causal attention.<n>We show that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency.<n>We achieve autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.
arXiv Detail & Related papers (2025-06-02T07:16:01Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.