A Systematic Post-Train Framework for Video Generation
Abstract Overview
This paper proposes a unified post-training framework for video diffusion models organized into four stages: supervised fine-tuning (SFT) to establish stable instruction-following behavior, GRPO-based reinforcement learning from human feedback (RLHF) to improve perceptual quality and temporal coherence, prompt enhancement (PE) via an LLM trained with the same reward signals to refine user inputs, and autoregressive distillation (AD) using a self-forcing objective for more efficient inference. The framework targets common deployment issues including prompt sensitivity, temporal inconsistency, local artifacts, and high sampling cost. Human evaluation using a Good-Same-Bad (GSB) protocol is conducted across visual quality, motion quality, and text alignment on an internal video generation model.
Novelty
The main novelty is the systematic integration of four post-training components—SFT, GRPO-based RLHF adapted for flow-matching video diffusion, reward-driven prompt enhancement, and autoregressive distillation—into a single unified pipeline rather than addressing these objectives in isolation. The work also applies isotemporal grouping with single-timestep ODE-to-SDE transitions and temporal gradient rectification to make GRPO tractable for video generation, and uses the same reward-driven framework to train both the generator and a prompt enhancer.
Results
On the authors' internal model, the RLHF stage achieves a 31% improvement in the overall GSB metric, with the largest gains in visual quality and motion quality and more modest gains in text alignment, which the authors attribute to limitations of the current text alignment reward model. Adding the prompt enhancer yields a further 20% overall GSB improvement, driven primarily by visual and motion quality gains while preserving text alignment.
Key Points
- The framework organizes video post-training into four stages—SFT, GRPO-based RLHF with isotemporal grouping and temporal gradient rectification, prompt enhancement, and autoregressive distillation—each addressing distinct deployment gaps.
- Human evaluation shows the strongest GSB gains in visual quality and motion quality, while text-alignment improvement is more modest, which the authors attribute to limited accuracy of the current text-video alignment reward model.
- Prompt enhancement complements generator-side RLHF by optimizing user inputs under similar reward signals (text-video alignment, video aesthetics, and structural constraints) without modifying the generative backbone.
References
- arXiv: https://arxiv.org/abs/2604.25427v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.25427v1
- Hugging Face Papers: https://huggingface.co/papers/2604.25427