Relative Policy-Transition Optimization for Fast Policy Transfer
- URL: http://arxiv.org/abs/2206.06009v3
- Date: Wed, 24 Jan 2024 15:23:09 GMT
- Title: Relative Policy-Transition Optimization for Fast Policy Transfer
- Authors: Jiawei Xu, Cheng Zhou, Yizheng Zhang, Baoxiang Wang, Lei Han
- Abstract summary: We consider the problem of policy transfer between two Markov Decision Processes (MDPs)
We propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO)
RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments.
- Score: 18.966619060222634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of policy transfer between two Markov Decision
Processes (MDPs). We introduce a lemma based on existing theoretical results in
reinforcement learning to measure the relativity gap between two arbitrary
MDPs, that is the difference between any two cumulative expected returns
defined on different policies and environment dynamics. Based on this lemma, we
propose two new algorithms referred to as Relative Policy Optimization (RPO)
and Relative Transition Optimization (RTO), which offer fast policy transfer
and dynamics modelling, respectively. RPO transfers the policy evaluated in one
environment to maximize the return in another, while RTO updates the
parameterized dynamics model to reduce the gap between the dynamics of the two
environments. Integrating the two algorithms results in the complete Relative
Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts
with the two environments simultaneously, such that data collections from two
environments, policy and transition updates are completed in one closed loop to
form a principled learning framework for policy transfer. We demonstrate the
effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating
policy transfer problems via variant dynamics.
Related papers
- Dual Alignment Maximin Optimization for Offline Model-based RL [10.048622079413313]
offline reinforcement agents face significant deployment challenges due to the synthetic-to-real distribution mismatch.
In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data.
It is a unified framework to ensure both model-environment policy consistency and synthetic and data offline.
arXiv Detail & Related papers (2025-02-02T16:47:35Z) - Policy Gradient for Robust Markov Decision Processes [16.281897051782863]
This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (MD), for solving robust Markov Decision Processes (MDPs)
MD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy.
We provide a comprehensive analysis of MD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA)
arXiv Detail & Related papers (2024-10-29T15:16:02Z) - Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation [1.079960007119637]
OPS-DeMo is an online algorithm that employs dynamic error decay to detect changes in opponents' policies.
Our approach outperforms PPO-trained models in dynamic scenarios like the Predator-Prey setting.
arXiv Detail & Related papers (2024-06-10T17:34:44Z) - Fast Policy Learning for Linear Quadratic Control with Entropy
Regularization [10.771650397337366]
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems.
Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC.
arXiv Detail & Related papers (2023-11-23T19:08:39Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - A State-Augmented Approach for Learning Optimal Resource Management
Decisions in Wireless Networks [58.720142291102135]
We consider a radio resource management (RRM) problem in a multi-user wireless network.
The goal is to optimize a network-wide utility function subject to constraints on the ergodic average performance of users.
We propose a state-augmented parameterization for the RRM policy, where alongside the instantaneous network states, the RRM policy takes as input the set of dual variables corresponding to the constraints.
arXiv Detail & Related papers (2022-10-28T21:24:13Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.