Related papers: A Risk-Sensitive Approach to Policy Optimization

A Risk-Sensitive Approach to Policy Optimization

URL: http://arxiv.org/abs/2208.09106v2
Date: Thu, 16 Nov 2023 03:51:30 GMT
Title: A Risk-Sensitive Approach to Policy Optimization
Authors: Jared Markowitz, Ryan W. Gardner, Ashley Llorens, Raman Arora, I-Jeng Wang
Abstract summary: Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy. We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized. We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
Score: 21.684251937825234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy. This differs from human decision-making, where gains and losses are valued differently and outlying outcomes are given increased consideration. It also fails to capitalize on opportunities to improve safety and/or performance through the incorporation of distributional context. Several approaches to distributional DRL have been investigated, with one popular strategy being to evaluate the projected distribution of returns for possible actions. We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized. This approach allows for outcomes to be weighed based on relative quality, can be used for both continuous and discrete action spaces, and may naturally be applied in both constrained and unconstrained settings. We show how to compute an asymptotically consistent estimate of the policy gradient for a broad class of risk-sensitive objectives via sampling, subsequently incorporating variance reduction and regularization measures to facilitate effective on-policy learning. We then demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies. We test the approach using different risk profiles in six OpenAI Safety Gym environments, comparing to state of the art on-policy methods. Without cost constraints, we find that pessimistic risk profiles can be used to reduce cost while improving total reward accumulation. With cost constraints, they are seen to provide higher positive rewards than risk-neutral approaches at the prescribed allowable cost.

Related papers

RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training [13.309653291779233]
Reinforcement learning with verifiable reward has emerged as a central paradigm for post-training large language models (LLMs)<n>We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths.<n>We propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures.
arXiv Detail & Related papers (2025-10-01T13:53:09Z)
Data-driven decision-making under uncertainty with entropic risk measure [5.407319151576265]
The entropic risk measure is widely used in high-stakes decision making to account for tail risks associated with an uncertain loss. To debias the empirical entropic risk estimator, we propose a strongly consistent bootstrapping procedure. We show that cross validation methods can result in significantly higher out-of-sample risk for the insurer if the bias in validation performance is not corrected for.
arXiv Detail & Related papers (2024-09-30T04:02:52Z)
Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence [15.720824593964027]
Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. This paper introduces a policy gradient method for risk-sensitive DRL with general coherent risk measures. We also design a categorical distributional policy gradient algorithm (CDPG) based on categorical distributional policy evaluation and trajectory gradient estimation.
arXiv Detail & Related papers (2024-05-23T16:16:58Z)
Diffusion Policies for Risk-Averse Behavior Modeling in Offline Reinforcement Learning [26.34178581703107]
offline reinforcement learning (RL) presents distinct challenges as it relies solely on observational data.<n>We propose an uncertainty-aware distributional offline RL method to simultaneously address both uncertainty and environmentality.<n>Our method is rigorously evaluated through comprehensive experiments in both risk-sensitive and risk-neutral benchmarks, demonstrating its superior performance.
arXiv Detail & Related papers (2024-03-26T12:28:04Z)
A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents [44.09686403685058]
We study risk-sensitive RL where the goal is learn a history-dependent policy that optimize some risk measure of cumulative rewards. We propose two meta-algorithms: one grounded in optimism and another based on policy gradients. We empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP.
arXiv Detail & Related papers (2024-03-10T21:45:12Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Is Risk-Sensitive Reinforcement Learning Properly Resolved? [54.00107408956307]
We propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement.<n>Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies.
arXiv Detail & Related papers (2023-07-02T11:47:21Z)
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z)
Efficient Risk-Averse Reinforcement Learning [79.61412643761034]
In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks.
arXiv Detail & Related papers (2022-05-10T19:40:52Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
Bayesian Robust Optimization for Imitation Learning [34.40385583372232]
Inverse reinforcement learning can enable generalization to new states by learning a parameterized reward function. Existing safe imitation learning approaches based on IRL deal with this uncertainty using a maxmin framework. BROIL provides a natural way to interpolate between return-maximizing and risk-minimizing behaviors.
arXiv Detail & Related papers (2020-07-24T01:52:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.