Fugu-MT 論文翻訳(概要): BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

論文の概要: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

arxiv url: http://arxiv.org/abs/2509.06040v4
Date: Tue, 16 Sep 2025 13:50:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 13:40:22.844474
Title: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
Title（参考訳）: BranchGRPO: 拡散モデルにおける構造分岐を伴う安定かつ効率的なGRPO
Authors: Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang,
Abstract要約: BranchGRPOは、ロールアウトプロセスを分岐木に再構成する手法である。 HPDv2.1イメージアライメントでは、BranchGRPOはDanceGRPOよりも最大でtextbf16%のアライメントスコアを改善する。ハイブリッド版であるBranchGRPO-MixはDanceGRPOよりも4.7倍の速度でトレーニングを加速する。
参考スコア（独自算出の注目度）: 57.304411396229035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
Abstract（参考訳）: 画像およびビデオ生成モデルとGRPO(Group Relative Policy Optimization)との整合化の最近の進歩は、人間の嗜好の整合性を改善しているが、既存の変種は、連続的なロールアウトと大量のサンプリングステップ、信頼できない信用割当により、非効率のままである。本稿では,分岐木にロールアウト処理を再構成する手法であるブランチGRPOについて述べる。この手法では,共有プレフィックスが計算とプルーニングを補正し,低値パスと冗長な深さを除去する。ブランチGRPO では,(1) 探索多様性を維持しながら共有接頭辞によるロールアウトコストを補正する分岐スキーム,(2) スパース終末報酬を高密度ステップレベル信号に変換する報奨融合と深度優位性推定器,(3) 勾配計算を削減しながら前向きのロールアウトや探索に影響を与えないプルーニング戦略を導入している。 HPDv2.1 イメージアライメントでは、BranchGRPO は DanceGRPO 上で \textbf{16\%} までのアライメントスコアを改善し、また、設定毎のトレーニング時間をほぼ \textbf{55\%} に短縮する。ハイブリッド版であるBranchGRPO-MixはDanceGRPOよりも4.7倍の速度でトレーニングを加速する。 WanXビデオ生成では、DanceGRPOに比べてシャープで時間的に一貫したフレームで高画質のVoice-Alignスコアを得る。コードは \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO} で公開されている。

論文の概要: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

関連論文リスト