PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
- URL: http://arxiv.org/abs/2509.25774v1
- Date: Tue, 30 Sep 2025 04:43:58 GMT
- Title: PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
- Authors: Jeongjae Lee, Jong Chul Ye,
- Abstract summary: We introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps.<n>PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.
- Score: 54.18605375476406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.
Related papers
- Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment [25.916354359994624]
We propose Q-Hawkeye, an RL-based reliable visual policy optimization framework.<n>Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts.<n>We introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence.
arXiv Detail & Related papers (2026-01-30T12:42:32Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization [9.358876832727239]
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades.<n>We introduce M-GRPO, a framework that leverages a slowly evolving momentum model to provide a stable training target.<n>We also propose an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories.
arXiv Detail & Related papers (2025-12-15T08:07:23Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation [37.43722287763904]
A subject-driven image generation model faces a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability)<n>We propose a novel framework featuring two key innovations: Synergy-Aware Reward Shaping and Time-Aware Dynamic Weighting.<n>Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
arXiv Detail & Related papers (2025-10-21T03:32:26Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning [17.928214942495412]
ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase.<n>We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro.<n>Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
arXiv Detail & Related papers (2025-10-01T09:11:27Z) - STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation [16.40446848402754]
Reinforcement learning has recently been explored to improve text-to-image generation.<n>Applying existing GRPO algorithms to autoregressive (AR) image models remains challenging.<n>In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics.
arXiv Detail & Related papers (2025-09-29T16:50:21Z) - TempFlow-GRPO: When Timing Matters for GRPO in Flow Models [22.023027865557637]
We introduce a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation.<n>New innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics.
arXiv Detail & Related papers (2025-08-06T11:10:39Z) - Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z) - Enhancing Variational Autoencoders with Smooth Robust Latent Encoding [54.74721202894622]
Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models.<n>We introduce Smooth Robust Latent VAE, a novel adversarial training framework that boosts both generation quality and robustness.<n>Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks.
arXiv Detail & Related papers (2025-04-24T03:17:57Z) - ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models.<n>We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z) - Generative Diffusion Prior for Unified Image Restoration and Enhancement [62.76390152617949]
Existing image restoration methods mostly leverage the posterior distribution of natural images.
We propose the Generative Diffusion Prior (GDP) to effectively model the posterior distributions in an unsupervised sampling manner.
GDP utilizes a pre-train denoising diffusion generative model (DDPM) for solving linear inverse, non-linear, or blind problems.
arXiv Detail & Related papers (2023-04-03T16:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.