Continuous MDP Homomorphisms and Homomorphic Policy Gradient
- URL: http://arxiv.org/abs/2209.07364v1
- Date: Thu, 15 Sep 2022 15:26:49 GMT
- Title: Continuous MDP Homomorphisms and Homomorphic Policy Gradient
- Authors: Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger,
Doina Precup
- Abstract summary: We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces.
We propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously.
- Score: 51.25171126424949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abstraction has been widely studied as a way to improve the efficiency and
generalization of reinforcement learning algorithms. In this paper, we study
abstraction in the continuous-control setting. We extend the definition of MDP
homomorphisms to encompass continuous actions in continuous state spaces. We
derive a policy gradient theorem on the abstract MDP, which allows us to
leverage approximate symmetries of the environment for policy optimization.
Based on this theorem, we propose an actor-critic algorithm that is able to
learn the policy and the MDP homomorphism map simultaneously, using the lax
bisimulation metric. We demonstrate the effectiveness of our method on
benchmark tasks in the DeepMind Control Suite. Our method's ability to utilize
MDP homomorphisms for representation learning leads to improved performance
when learning from pixel observations.
Related papers
- On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Low-Rank MDPs with Continuous Action Spaces [42.695778474071254]
We study the problem of extending such methods to settings with continuous actions.
We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous.
arXiv Detail & Related papers (2023-11-06T22:05:08Z) - Policy Gradient Methods in the Presence of Symmetries and State
Abstractions [46.66541516203923]
Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization.
We study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces.
We propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously.
arXiv Detail & Related papers (2023-05-09T17:59:10Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - BATS: Best Action Trajectory Stitching [22.75880303352508]
We introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset.
We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics.
We show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.
arXiv Detail & Related papers (2022-04-26T01:48:32Z) - Plannable Approximations to MDP Homomorphisms: Equivariance under
Actions [72.30921397899684]
We introduce a contrastive loss function that enforces action equivariance on the learned representations.
We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process.
We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP.
arXiv Detail & Related papers (2020-02-27T08:29:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.