Related papers: Global Algorithms for Mean-Variance Optimization in Markov Decision Processes

Global Algorithms for Mean-Variance Optimization in Markov Decision Processes

URL: http://arxiv.org/abs/2302.13710v1
Date: Mon, 27 Feb 2023 12:17:43 GMT
Title: Global Algorithms for Mean-Variance Optimization in Markov Decision Processes
Authors: Li Xia, Shuai Ma
Abstract summary: Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. We propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP.
Score: 8.601670707452083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can find the globally optimal policy by repeatedly removing worse policy spaces. The convergence and complexity of the algorithm are studied. Another policy dominance property is also proposed to further improve the algorithm efficiency. Numerical experiments demonstrate the performance and efficiency of our algorithms. To the best of our knowledge, our algorithm is the first that efficiently finds the globally optimal policy of mean-variance optimization in MDPs. These results are also valid for solely minimizing the variance metrics in MDPs.

Related papers

A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
MDP Geometry, Normalization and Reward Balancing Solvers [15.627546283580166]
The Markov Decision Process (MDP) is a widely used mathematical model for sequential decision-making problems. We present a new geometric interpretation of MDPs with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy.
arXiv Detail & Related papers (2024-07-09T09:39:45Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets [10.26382228865201]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets. The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable. We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints [36.16736392624796]
We introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
arXiv Detail & Related papers (2023-02-02T00:23:36Z)
Robust Average-Reward Markov Decision Processes [25.125481838479256]
We focus on robust average-reward MDPs, where the goal is to find a policy that optimize the worst-case average reward over an uncertainty set. We take an approach that approximates average-reward MDPs using discounted MDPs. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value algorithm that provably finds its solution.
arXiv Detail & Related papers (2023-01-02T19:51:55Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
A novel multiobjective evolutionary algorithm based on decomposition and multi-reference points strategy [14.102326122777475]
Multiobjective evolutionary algorithm based on decomposition (MOEA/D) has been regarded as a significantly promising approach for solving multiobjective optimization problems (MOPs) We propose an improved MOEA/D algorithm by virtue of the well-known Pascoletti-Serafini scalarization method and a new strategy of multi-reference points.
arXiv Detail & Related papers (2021-10-27T02:07:08Z)
Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs) We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO) PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean and Variance [3.062772835338966]
This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric. A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies. A necessary condition of the optimal policy and the optimality of deterministic policies are derived.
arXiv Detail & Related papers (2020-08-09T10:35:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.