Fugu-MT 論文翻訳(概要): A Systematic Post-Train Framework for Video Generation

論文の概要: A Systematic Post-Train Framework for Video Generation

arxiv url: http://arxiv.org/abs/2604.25427v1
Date: Tue, 28 Apr 2026 09:34:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.80119
Title: A Systematic Post-Train Framework for Video Generation
Title（参考訳）: 映像生成のための系統的ポストトレインフレームワーク
Authors: Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, Nan Duan, Ping Luo,
Abstract要約: 大規模ビデオ拡散モデルでは、高解像度でセマンティックにリッチなコンテンツを生成できることが顕著に示されている。迅速な感度、時間的不整合、禁止的推論コストといった重要な問題のために、事前訓練されたパフォーマンスと実際のデプロイメント要件の間には、大きなギャップが残っている。本研究では,事前学習されたモデルとユーザの意図を4つの相乗的段階を通して体系的に整合させる総合的なポストトレーニングフレームワークを提案する。
参考スコア（独自算出の注目度）: 76.26555417456773
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.
Abstract（参考訳）: 大規模ビデオ拡散モデルは、高解像度でセマンティックにリッチなコンテンツを生成するという印象的な能力を示しているが、迅速な感度、時間的不整合、禁止的推論コストといった重要な問題により、事前訓練されたパフォーマンスと実際のデプロイメント要件との間には大きなギャップが残っている。このギャップを埋めるために,我々はまず,事前学習されたモデルとユーザの意図を体系的に整合させるための総合的なポストトレーニングフレームワークを提案する。まず,提案手法は,基本モデルを安定的な命令追従ポリシに変換するためにスーパーバイザードファインタニング(SFT)を使用し,次に,視覚的品質と時間的コヒーレンスを向上するためにビデオ拡散に適したグループ相対的ポリシー最適化(GRPO)手法を用いた,ヒューマンフィードバック(RLHF)段階からの強化学習を行い,さらに,ユーザ入力を洗練させるために特別な言語モデルを用いてプロンプトエンハンスメントを統合し,推論最適化を通じてシステム効率に対処する。これらのコンポーネントは、事前学習中に学習した制御性を保ちながら、視覚的品質、時間的コヒーレンス、および指示に従うための体系的なアプローチを提供する。その結果は、安定的で適応性があり、現実のデプロイメントに有効であるスケーラブルなポストトレーニングパイプラインを構築するための、実用的な青写真になります。大規模な実験により、この統合パイプラインは共通のアーティファクトを効果的に軽減し、厳密なサンプリングコストの制約に固執しながら、制御性と視覚的美学を大幅に改善することを示した。

論文の概要: A Systematic Post-Train Framework for Video Generation

関連論文リスト