BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
- URL: http://arxiv.org/abs/2509.06040v5
- Date: Mon, 29 Sep 2025 13:20:45 GMT
- Title: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
- Authors: Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang,
- Abstract summary: We present BranchGRPO, a method that restructures the rollout process into a branching tree.<n>On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to textbf16% over DanceGRPO.<n>A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment.
- Score: 57.304411396229035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models [14.130608036489336]
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption.<n>We introduce textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree.
arXiv Detail & Related papers (2025-12-09T01:17:34Z) - Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards [18.89733659101796]
Group Relative Policy Optimization has shown promising potential for aligning text-to-image (T2I) models.<n>Existing GRPO-based methods suffer from two critical limitations.<n>We propose textbfMulti-GRPO, a multi-group advantage estimation framework.
arXiv Detail & Related papers (2025-11-30T05:44:35Z) - Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards [48.321707628011005]
Lookahead Tree-Based Rollouts (LATR) is a novel rollout strategy designed to explicitly promote trajectory-level diversity.<n>LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2%.
arXiv Detail & Related papers (2025-10-28T11:12:02Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning [45.51804571136028]
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs)<n>We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages.<n>SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training.
arXiv Detail & Related papers (2025-10-05T07:22:54Z) - TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling [65.46347858249295]
TreePO is a self-guided rollout algorithm that views sequence generation as a tree-structured searching process.<n>TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity.
arXiv Detail & Related papers (2025-08-24T16:52:37Z) - Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning [48.426139299991604]
We present textbfSPRO, a novel framework that enables process-aware RL through two key innovations.<n>SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5% test accuracy improvement.<n> Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
arXiv Detail & Related papers (2025-07-02T10:05:14Z) - RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors [54.81109375939306]
RGE-GS is a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration.<n>We propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases.<n>During the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics.
arXiv Detail & Related papers (2025-06-28T08:02:54Z) - TreeRPO: Tree Relative Policy Optimization [55.97385410074841]
name is a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling.<n>Building on the group-relative reward training mechanism of GRPO, name innovatively computes rewards based on step-level groups generated during tree sampling.
arXiv Detail & Related papers (2025-06-05T15:56:38Z) - CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [68.26281707780761]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>We show that CPPO achieves up to $8.32times$ speedup on GSM8K and $3.51times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.