Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation
- URL: http://arxiv.org/abs/2504.20887v2
- Date: Mon, 21 Jul 2025 03:55:34 GMT
- Title: Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation
- Authors: Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes,
- Abstract summary: We propose a reformulation of the optimisation problem by capping the total return of trajectories used in training.<n>We show that this is equivalent to the original problem if the cap is set appropriately.
- Score: 16.74312997149021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: https://github.com/HarryMJMead/cvar-return-capping.
Related papers
- Boosting CVaR Policy Optimization with Quantile Gradients [10.868006419885601]
We improve Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG)<n> Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency.<n> Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG.
arXiv Detail & Related papers (2026-01-29T18:33:46Z) - Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning [52.97053840476386]
We show that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates.<n>We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved.
arXiv Detail & Related papers (2025-11-13T23:06:40Z) - Reparameterization Proximal Policy Optimization [35.59197802340267]
Policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics.<n>We draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse.<n>We propose Re Parameters Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method.<n>RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG.
arXiv Detail & Related papers (2025-08-08T10:50:55Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Risk-averse Learning with Non-Stationary Distributions [18.15046585146849]
In this paper, we investigate risk-averse online optimization where the distribution of the random cost changes over time.
We minimize risk-averse objective function using the Conditional Value at Risk (CVaR) as risk measure.
We show that our designed learning algorithm achieves sub-linear dynamic regret with high probability for both convex and strongly convex functions.
arXiv Detail & Related papers (2024-04-03T18:16:47Z) - A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization [33.752940941471756]
Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency.
We propose a simple mixture policy parameterization that integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy.
Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains.
arXiv Detail & Related papers (2024-03-17T02:24:09Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - SPEED: Experimental Design for Policy Evaluation in Linear
Heteroscedastic Bandits [13.02672341061555]
We study the problem of optimal data collection for policy evaluation in linear bandits.
We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting.
We then use this formulation to derive the optimal allocation of samples per action during data collection.
arXiv Detail & Related papers (2023-01-29T04:33:13Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation.
VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z) - Exact Optimization of Conformal Predictors via Incremental and
Decremental Learning [46.9970555048259]
Conformal Predictors (CP) are wrappers around ML methods, providing error guarantees under weak assumptions on the data distribution.
They are suitable for a wide range of problems, from classification and regression to anomaly detection.
We show that it is possible to speed up a CP classifier considerably, by studying it in conjunction with the underlying ML method, and by exploiting incremental&decremental learning.
arXiv Detail & Related papers (2021-02-05T15:31:37Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Top-k Training of GANs: Improving GAN Performance by Throwing Away Bad
Samples [67.11669996924671]
We introduce a simple (one line of code) modification to the Generative Adversarial Network (GAN) training algorithm.
When updating the generator parameters, we zero out the gradient contributions from the elements of the batch that the critic scores as least realistic'
We show that this top-k update' procedure is a generally applicable improvement.
arXiv Detail & Related papers (2020-02-14T19:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.