Related papers: Reparameterization Proximal Policy Optimization

Reparameterization Proximal Policy Optimization

URL: http://arxiv.org/abs/2508.06214v2
Date: Thu, 25 Sep 2025 15:17:31 GMT
Title: Reparameterization Proximal Policy Optimization
Authors: Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang,
Abstract summary: Policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics.<n>We draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse.<n>We propose Re Parameters Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method.<n>RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG.
Score: 35.59197802340267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG. It is further stabilized by Kullback-Leibler (KL) divergence regularization and remains fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.

Related papers

Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z)
Value-Free Policy Optimization via Reward Partitioning [0.08192907805418585]
We introduce Reward Partitioning Optimization (RPO), a new method for single-trajectory reinforcement learning.<n>RPO normalizes observed rewards using a approach estimated directly from data.<n>We validate RPO on scalar-feedback language modeling tasks using Flan-T5 encoder-decoder models.
arXiv Detail & Related papers (2025-06-16T17:06:27Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.<n>We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method.<n>Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate. We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z)
On the Reuse Bias in Off-Policy Reinforcement Learning [28.29153543457396]
Reuse Bias is the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective. We present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias.
arXiv Detail & Related papers (2022-09-15T06:20:36Z)
Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.