Global Algorithms for Mean-Variance Optimization in Markov Decision
Processes
- URL: http://arxiv.org/abs/2302.13710v1
- Date: Mon, 27 Feb 2023 12:17:43 GMT
- Title: Global Algorithms for Mean-Variance Optimization in Markov Decision
Processes
- Authors: Li Xia, Shuai Ma
- Abstract summary: Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming.
We propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP.
- Score: 8.601670707452083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dynamic optimization of mean and variance in Markov decision processes (MDPs)
is a long-standing challenge caused by the failure of dynamic programming. In
this paper, we propose a new approach to find the globally optimal policy for
combined metrics of steady-state mean and variance in an infinite-horizon
undiscounted MDP. By introducing the concepts of pseudo mean and pseudo
variance, we convert the original problem to a bilevel MDP problem, where the
inner one is a standard MDP optimizing pseudo mean-variance and the outer one
is a single parameter selection problem optimizing pseudo mean. We use the
sensitivity analysis of MDPs to derive the properties of this bilevel problem.
By solving inner standard MDPs for pseudo mean-variance optimization, we can
identify worse policy spaces dominated by optimal policies of the pseudo
problems. We propose an optimization algorithm which can find the globally
optimal policy by repeatedly removing worse policy spaces. The convergence and
complexity of the algorithm are studied. Another policy dominance property is
also proposed to further improve the algorithm efficiency. Numerical
experiments demonstrate the performance and efficiency of our algorithms. To
the best of our knowledge, our algorithm is the first that efficiently finds
the globally optimal policy of mean-variance optimization in MDPs. These
results are also valid for solely minimizing the variance metrics in MDPs.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - MDP Geometry, Normalization and Reward Balancing Solvers [15.627546283580166]
The Markov Decision Process (MDP) is a widely used mathematical model for sequential decision-making problems.
We present a new geometric interpretation of MDPs with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy.
arXiv Detail & Related papers (2024-07-09T09:39:45Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Policy Gradient Algorithms for Robust MDPs with Non-Rectangular
Uncertainty Sets [10.26382228865201]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets.
The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable.
We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints [36.16736392624796]
We introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion.
We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm.
We show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
arXiv Detail & Related papers (2023-02-02T00:23:36Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - A novel multiobjective evolutionary algorithm based on decomposition and
multi-reference points strategy [14.102326122777475]
Multiobjective evolutionary algorithm based on decomposition (MOEA/D) has been regarded as a significantly promising approach for solving multiobjective optimization problems (MOPs)
We propose an improved MOEA/D algorithm by virtue of the well-known Pascoletti-Serafini scalarization method and a new strategy of multi-reference points.
arXiv Detail & Related papers (2021-10-27T02:07:08Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean
and Variance [3.062772835338966]
This paper investigates the optimization problem of an infinite stage discrete time Markov decision process (MDP) with a long-run average metric.
A performance difference formula is derived and it can quantify the difference of the mean-variance combined metrics of MDPs under any two different policies.
A necessary condition of the optimal policy and the optimality of deterministic policies are derived.
arXiv Detail & Related papers (2020-08-09T10:35:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.