Boosting CVaR Policy Optimization with Quantile Gradients
- URL: http://arxiv.org/abs/2601.22100v2
- Date: Wed, 04 Feb 2026 21:55:23 GMT
- Title: Boosting CVaR Policy Optimization with Quantile Gradients
- Authors: Yudong Luo, Erick Delage,
- Abstract summary: We improve Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG)<n> Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency.<n> Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG.
- Score: 10.868006419885601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.
Related papers
- Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation [16.74312997149021]
We propose a reformulation of the optimisation problem by capping the total return of trajectories used in training.<n>We show that this is equivalent to the original problem if the cap is set appropriately.
arXiv Detail & Related papers (2025-04-29T16:04:16Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Risk-averse Learning with Non-Stationary Distributions [18.15046585146849]
In this paper, we investigate risk-averse online optimization where the distribution of the random cost changes over time.
We minimize risk-averse objective function using the Conditional Value at Risk (CVaR) as risk measure.
We show that our designed learning algorithm achieves sub-linear dynamic regret with high probability for both convex and strongly convex functions.
arXiv Detail & Related papers (2024-04-03T18:16:47Z) - A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization [33.752940941471756]
Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency.
We propose a simple mixture policy parameterization that integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy.
Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains.
arXiv Detail & Related papers (2024-03-17T02:24:09Z) - Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD)
We use gradient new outliers as norm clipping chains.
We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z) - Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion [9.792654108885676]
We study the infinite-horizon discrete-time Markov decision process (MDP) with a long-run CVaR criterion.
We derive a pseudo CVaR difference formula which quantifies the difference of long-run CVaR under any two policies.
We develop a policy type algorithm to efficiently optimize CVaR, which is shown to converge to local optima in the mixed policy space.
arXiv Detail & Related papers (2022-10-17T04:29:14Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Stochastic Optimization of Areas Under Precision-Recall Curves with
Provable Convergence [66.83161885378192]
Area under ROC (AUROC) and precision-recall curves (AUPRC) are common metrics for evaluating classification performance for imbalanced problems.
We propose a technical method to optimize AUPRC for deep learning.
arXiv Detail & Related papers (2021-04-18T06:22:21Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.