Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean
and Variance
- URL: http://arxiv.org/abs/2008.03707v1
- Date: Sun, 9 Aug 2020 10:35:35 GMT
- Title: Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean
and Variance
- Authors: Li Xia
- Abstract summary: This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric.
A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies.
A necessary condition of the optimal policy and the optimality of deterministic policies are derived.
- Score: 3.062772835338966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the optimization problem of an infinite stage
discrete time Markov decision process (MDP) with a long-run average metric
considering both mean and variance of rewards together. Such performance metric
is important since the mean indicates average returns and the variance
indicates risk or fairness. However, the variance metric couples the rewards at
all stages, the traditional dynamic programming is inapplicable as the
principle of time consistency fails. We study this problem from a new
perspective called the sensitivity-based optimization theory. A performance
difference formula is derived and it can quantify the difference of the
mean-variance combined metrics of MDPs under any two different policies. The
difference formula can be utilized to generate new policies with strictly
improved mean-variance performance. A necessary condition of the optimal policy
and the optimality of deterministic policies are derived. We further develop an
iterative algorithm with a form of policy iteration, which is proved to
converge to local optima both in the mixed and randomized policy space.
Specially, when the mean reward is constant in policies, the algorithm is
guaranteed to converge to the global optimum. Finally, we apply our approach to
study the fluctuation reduction of wind power in an energy storage system,
which demonstrates the potential applicability of our optimization method.
Related papers
- Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)
By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.
This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Global Algorithms for Mean-Variance Optimization in Markov Decision
Processes [8.601670707452083]
Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming.
We propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP.
arXiv Detail & Related papers (2023-02-27T12:17:43Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - A unified algorithm framework for mean-variance optimization in
discounted Markov decision processes [7.510742715895749]
This paper studies the risk-averse mean-variance optimization in infinite-horizon discounted Markov decision processes (MDPs)
We introduce a pseudo mean to transform the untreatable MDP to a standard one with a redefined reward function in standard form.
We propose a unified algorithm framework with a bilevel optimization structure for the discounted mean-variance optimization.
arXiv Detail & Related papers (2022-01-15T02:19:56Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.