A Natural Actor-Critic Algorithm with Downside Risk Constraints
- URL: http://arxiv.org/abs/2007.04203v1
- Date: Wed, 8 Jul 2020 15:44:33 GMT
- Title: A Natural Actor-Critic Algorithm with Downside Risk Constraints
- Authors: Thomas Spooner and Rahul Savani
- Abstract summary: We introduce a new Bellman equation that upper bounds the lower partial moment, circumventing its non-linearity.
We prove that this proxy for the lower partial moment is a contraction, and provide intuition into the stability of the algorithm by variance decomposition.
We extend the method to use natural policy gradients and demonstrate the effectiveness of our approach on three benchmark problems for risk-sensitive reinforcement learning.
- Score: 5.482532589225552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing work on risk-sensitive reinforcement learning - both for symmetric
and downside risk measures - has typically used direct Monte-Carlo estimation
of policy gradients. While this approach yields unbiased gradient estimates, it
also suffers from high variance and decreased sample efficiency compared to
temporal-difference methods. In this paper, we study prediction and control
with aversion to downside risk which we gauge by the lower partial moment of
the return. We introduce a new Bellman equation that upper bounds the lower
partial moment, circumventing its non-linearity. We prove that this proxy for
the lower partial moment is a contraction, and provide intuition into the
stability of the algorithm by variance decomposition. This allows
sample-efficient, on-line estimation of partial moments. For risk-sensitive
control, we instantiate Reward Constrained Policy Optimization, a recent
actor-critic method for finding constrained policies, with our proxy for the
lower partial moment. We extend the method to use natural policy gradients and
demonstrate the effectiveness of our approach on three benchmark problems for
risk-sensitive reinforcement learning.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Risk-averse Learning with Non-Stationary Distributions [18.15046585146849]
In this paper, we investigate risk-averse online optimization where the distribution of the random cost changes over time.
We minimize risk-averse objective function using the Conditional Value at Risk (CVaR) as risk measure.
We show that our designed learning algorithm achieves sub-linear dynamic regret with high probability for both convex and strongly convex functions.
arXiv Detail & Related papers (2024-04-03T18:16:47Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - An Alternative to Variance: Gini Deviation for Risk-averse Policy
Gradient [35.01235012813407]
Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning.
Recent methods restrict the per-step reward variance as a proxy.
We propose to use an alternative risk measure, Gini deviation, as a substitute.
arXiv Detail & Related papers (2023-07-17T22:08:27Z) - Vector-Valued Least-Squares Regression under Output Regularity
Assumptions [73.99064151691597]
We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output.
We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method.
arXiv Detail & Related papers (2022-11-16T15:07:00Z) - Risk-aware linear bandits with convex loss [0.0]
We propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits.
This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent.
arXiv Detail & Related papers (2022-09-15T09:09:53Z) - A Temporal-Difference Approach to Policy Gradient Estimation [27.749993205038148]
We propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy.
By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way.
arXiv Detail & Related papers (2022-02-04T21:23:33Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Cautious Reinforcement Learning via Distributional Risk in the Dual
Domain [45.17200683056563]
We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite.
We propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning.
arXiv Detail & Related papers (2020-02-27T23:18:04Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.