Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion
- URL: http://arxiv.org/abs/2210.08740v1
- Date: Mon, 17 Oct 2022 04:29:14 GMT
- Title: Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion
- Authors: Li Xia, Peter W. Glynn
- Abstract summary: We study the infinite-horizon discrete-time Markov decision process (MDP) with a long-run CVaR criterion.
We derive a pseudo CVaR difference formula which quantifies the difference of long-run CVaR under any two policies.
We develop a policy type algorithm to efficiently optimize CVaR, which is shown to converge to local optima in the mixed policy space.
- Score: 9.792654108885676
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: CVaR (Conditional Value at Risk) is a risk metric widely used in finance.
However, dynamically optimizing CVaR is difficult since it is not a standard
Markov decision process (MDP) and the principle of dynamic programming fails.
In this paper, we study the infinite-horizon discrete-time MDP with a long-run
CVaR criterion, from the view of sensitivity-based optimization. By introducing
a pseudo CVaR metric, we derive a CVaR difference formula which quantifies the
difference of long-run CVaR under any two policies. The optimality of
deterministic policies is derived. We obtain a so-called Bellman local
optimality equation for CVaR, which is a necessary and sufficient condition for
local optimal policies and only necessary for global optimal policies. A CVaR
derivative formula is also derived for providing more sensitivity information.
Then we develop a policy iteration type algorithm to efficiently optimize CVaR,
which is shown to converge to local optima in the mixed policy space. We
further discuss some extensions including the mean-CVaR optimization and the
maximization of CVaR. Finally, we conduct numerical experiments relating to
portfolio management to demonstrate the main results. Our work may shed light
on dynamically optimizing CVaR from a sensitivity viewpoint.
Related papers
- Stationary Policies are Optimal in Risk-averse Total-reward MDPs with EVaR [12.719528972742394]
We show that the risk-averse em total reward criterion can be optimized by a stationary policy.
Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
arXiv Detail & Related papers (2024-08-30T13:33:18Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Risk-Sensitive RL with Optimized Certainty Equivalents via Reduction to
Standard RL [48.1726560631463]
We study Risk-Sensitive Reinforcement Learning with the Optimized Certainty Equivalent (OCE) risk.
We propose two general meta-algorithms via reductions to standard RL.
We show that it learns the optimal risk-sensitive policy while prior algorithms provably fail.
arXiv Detail & Related papers (2024-03-10T21:45:12Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Forward-PECVaR Algorithm: Exact Evaluation for CVaR SSPs [1.347733333991357]
Conditional Value at Risk (CVaR) is a criterion that allows modeling an arbitrary level of risk.
We propose a new algorithm, Forward-PECVaR, that evaluates exactly stationary policies of CVaR-SSPs with non-uniform costs.
arXiv Detail & Related papers (2023-03-01T17:10:22Z) - Lexicographic Optimisation of Conditional Value at Risk and Expected
Value for Risk-Averse Planning in MDPs [4.87191262649216]
Planning in Markov decision processes (MDPs) typically optimises the expected cost.
An alternative approach is to find a policy which optimises a risk-averse objective such as conditional value at risk (CVaR)
We formulate the lexicographic optimisation problem of minimising the expected cost subject to the constraint that the CVaR of the total cost is optimal.
arXiv Detail & Related papers (2021-10-25T09:16:50Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Stochastic Optimization of Areas Under Precision-Recall Curves with
Provable Convergence [66.83161885378192]
Area under ROC (AUROC) and precision-recall curves (AUPRC) are common metrics for evaluating classification performance for imbalanced problems.
We propose a technical method to optimize AUPRC for deep learning.
arXiv Detail & Related papers (2021-04-18T06:22:21Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.