Model-Based Offline Meta-Reinforcement Learning with Regularization
- URL: http://arxiv.org/abs/2202.02929v1
- Date: Mon, 7 Feb 2022 04:15:20 GMT
- Title: Model-Based Offline Meta-Reinforcement Learning with Regularization
- Authors: Sen Lin, Jialin Wan, Tengyu Xu, Yingbin Liang, Junshan Zhang
- Abstract summary: offline Meta-RL is emerging as a promising approach to address these challenges.
MerPO learns a meta-model for efficient task structure inference and an informative meta-policy.
We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
- Score: 63.35040401948943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing offline reinforcement learning (RL) methods face a few major
challenges, particularly the distributional shift between the learned policy
and the behavior policy. Offline Meta-RL is emerging as a promising approach to
address these challenges, aiming to learn an informative meta-policy from a
collection of tasks. Nevertheless, as shown in our empirical studies, offline
Meta-RL could be outperformed by offline single-task RL methods on tasks with
good quality of datasets, indicating that a right balance has to be delicately
calibrated between "exploring" the out-of-distribution state-actions by
following the meta-policy and "exploiting" the offline dataset by staying close
to the behavior policy. Motivated by such empirical analysis, we explore
model-based offline Meta-RL with regularized Policy Optimization (MerPO), which
learns a meta-model for efficient task structure inference and an informative
meta-policy for safe exploration of out-of-distribution state-actions. In
particular, we devise a new meta-Regularized model-based Actor-Critic (RAC)
method for within-task policy optimization, as a key building block of MerPO,
using conservative policy evaluation and regularized policy improvement; and
the intrinsic tradeoff therein is achieved via striking the right balance
between two regularizers, one based on the behavior policy and the other on the
meta-policy. We theoretically show that the learnt policy offers guaranteed
improvement over both the behavior policy and the meta-policy, thus ensuring
the performance improvement on new tasks via offline Meta-RL. Experiments
corroborate the superior performance of MerPO over existing offline Meta-RL
methods.
Related papers
- Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL [42.57662196581823]
Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks.
Most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer.
We present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy.
arXiv Detail & Related papers (2024-05-28T18:38:46Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - On the Convergence Theory of Meta Reinforcement Learning with
Personalized Policies [26.225293232912716]
This paper proposes a novel personalized Meta-RL (pMeta-RL) algorithm.
It aggregates task-specific personalized policies to update a meta-policy used for all tasks, while maintaining personalized policies to maximize the average return of each task.
Experiment results show that the proposed algorithms outperform other previous Meta-RL algorithms on Gym and MuJoCo suites.
arXiv Detail & Related papers (2022-09-21T02:27:56Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance
Metric Learning and Behavior Regularization [10.243908145832394]
We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks.
This problem is still not fully understood, for which two major challenges need to be addressed.
We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches.
arXiv Detail & Related papers (2020-10-02T17:13:39Z) - Offline Meta-Reinforcement Learning with Advantage Weighting [125.21298190780259]
This paper introduces the offline meta-reinforcement learning (offline meta-RL) problem setting and proposes an algorithm that performs well in this setting.
offline meta-RL is analogous to the widely successful supervised learning strategy of pre-training a model on a large batch of fixed, pre-collected data.
We propose Meta-Actor Critic with Advantage Weighting (MACAW), an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training.
arXiv Detail & Related papers (2020-08-13T17:57:14Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.