OMPO: A Unified Framework for RL under Policy and Dynamics Shifts
- URL: http://arxiv.org/abs/2405.19080v1
- Date: Wed, 29 May 2024 13:36:36 GMT
- Title: OMPO: A Unified Framework for RL under Policy and Dynamics Shifts
- Authors: Yu Luo, Tianying Ji, Fuchun Sun, Jianwei Zhang, Huazhe Xu, Xianyuan Zhan,
- Abstract summary: Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
- Score: 42.57662196581823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge. Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors, thus often resulting in suboptimal policy performances and high learning variances. In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching. In light of this, we introduce a surrogate policy learning objective by considering the transition occupancy discrepancies and then cast it into a tractable min-max optimization problem through dual reformulation. Our method, dubbed Occupancy-Matching Policy Optimization (OMPO), features a specialized actor-critic structure equipped with a distribution discriminator and a small-size local buffer. We conduct extensive experiments based on the OpenAI Gym, Meta-World, and Panda Robots environments, encompassing policy shifts under stationary and nonstationary dynamics, as well as domain adaption. The results demonstrate that OMPO outperforms the specialized baselines from different categories in all settings. We also find that OMPO exhibits particularly strong performance when combined with domain randomization, highlighting its potential in RL-based robotics applications
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.
DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.
We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - Invariant Causal Imitation Learning for Generalizable Policies [87.51882102248395]
We propose Invariant Causal Learning (ICIL) to learn an imitation policy.
ICIL learns a representation of causal features that is disentangled from the specific representations of noise variables.
We show that ICIL is effective in learning imitation policies capable of generalizing to unseen environments.
arXiv Detail & Related papers (2023-11-02T16:52:36Z) - State Regularized Policy Optimization on Data with Dynamics Shift [25.412472472457324]
In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics.
In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions.
Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (textbfS textbfRegularized textbfPolicy textbfOptimization) algorithm.
arXiv Detail & Related papers (2023-06-06T10:06:09Z) - Cross-Domain Policy Adaptation via Value-Guided Data Filtering [57.62692881606099]
Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning.
We present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets.
arXiv Detail & Related papers (2023-05-28T04:08:40Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Fast Adaptation via Policy-Dynamics Value Functions [41.738462615120326]
We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
PD-VF explicitly estimates the cumulative reward in a space of policies and environments.
We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains.
arXiv Detail & Related papers (2020-07-06T16:47:56Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.