FuguReport

A Systematic Post-Train Framework for Video Generation

Authors Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, Nan Duan, Ping Luo
Affiliations JD.com / The University of Hong Kong / Zhejiang University / Tsinghua University / Peking University
Categories Method / Model Fine-Tuning / Post-training framework for alignment, Application / Video Generation / High-resolution semantically rich content, Evaluation / Deployment Efficiency / Gap between pretrained performance and deployment
License CC BY 4.0

Abstract Overview

This paper proposes a unified post-training framework for video diffusion models organized into four stages: supervised fine-tuning (SFT) to establish stable instruction-following behavior, GRPO-based reinforcement learning from human feedback (RLHF) to improve perceptual quality and temporal coherence, prompt enhancement (PE) via an LLM trained with the same reward signals to refine user inputs, and autoregressive distillation (AD) using a self-forcing objective for more efficient inference. The framework targets common deployment issues including prompt sensitivity, temporal inconsistency, local artifacts, and high sampling cost. Human evaluation using a Good-Same-Bad (GSB) protocol is conducted across visual quality, motion quality, and text alignment on an internal video generation model.

Novelty

The main novelty is the systematic integration of four post-training components—SFT, GRPO-based RLHF adapted for flow-matching video diffusion, reward-driven prompt enhancement, and autoregressive distillation—into a single unified pipeline rather than addressing these objectives in isolation. The work also applies isotemporal grouping with single-timestep ODE-to-SDE transitions and temporal gradient rectification to make GRPO tractable for video generation, and uses the same reward-driven framework to train both the generator and a prompt enhancer.

Results

On the authors' internal model, the RLHF stage achieves a 31% improvement in the overall GSB metric, with the largest gains in visual quality and motion quality and more modest gains in text alignment, which the authors attribute to limitations of the current text alignment reward model. Adding the prompt enhancer yields a further 20% overall GSB improvement, driven primarily by visual and motion quality gains while preserving text alignment.

Key Points

  1. The framework organizes video post-training into four stages—SFT, GRPO-based RLHF with isotemporal grouping and temporal gradient rectification, prompt enhancement, and autoregressive distillation—each addressing distinct deployment gaps.
  2. Human evaluation shows the strongest GSB gains in visual quality and motion quality, while text-alignment improvement is more modest, which the authors attribute to limited accuracy of the current text-video alignment reward model.
  3. Prompt enhancement complements generator-side RLHF by optimizing user inputs under similar reward signals (text-video alignment, video aesthetics, and structural constraints) without modifying the generative backbone.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.