A Policy-Guided Imitation Approach for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2210.08323v3
- Date: Wed, 5 Apr 2023 04:58:45 GMT
- Title: A Policy-Guided Imitation Approach for Offline Reinforcement Learning
- Authors: Haoran Xu, Li Jiang, Jianxiong Li, Xianyuan Zhan
- Abstract summary: We introduce Policy-guided Offline RL (textttPOR)
textttPOR demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL.
- Score: 9.195775740684248
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Offline reinforcement learning (RL) methods can generally be categorized into
two types: RL-based and Imitation-based. RL-based methods could in principle
enjoy out-of-distribution generalization but suffer from erroneous off-policy
evaluation. Imitation-based methods avoid off-policy evaluation but are too
conservative to surpass the dataset. In this study, we propose an alternative
approach, inheriting the training stability of imitation-style methods while
still allowing logical out-of-distribution generalization. We decompose the
conventional reward-maximizing policy in offline RL into a guide-policy and an
execute-policy. During training, the guide-poicy and execute-policy are learned
using only data from the dataset, in a supervised and decoupled manner. During
evaluation, the guide-policy guides the execute-policy by telling where it
should go so that the reward can be maximized, serving as the \textit{Prophet}.
By doing so, our algorithm allows \textit{state-compositionality} from the
dataset, rather than \textit{action-compositionality} conducted in prior
imitation-style methods. We dumb this new approach Policy-guided Offline RL
(\texttt{POR}). \texttt{POR} demonstrates the state-of-the-art performance on
D4RL, a standard benchmark for offline RL. We also highlight the benefits of
\texttt{POR} in terms of improving with supplementary suboptimal data and
easily adapting to new tasks by only changing the guide-poicy.
Related papers
- Diffusion Policies for Out-of-Distribution Generalization in Offline
Reinforcement Learning [1.9336815376402723]
offline RL methods leverage previous experiences to learn better policies than the behavior policy used for data collection.
However, offline RL algorithms face challenges in handling distribution shifts and effectively representing policies due to the lack of online interaction during training.
We introduce a novel method named State Reconstruction for Diffusion Policies (SRDP), incorporating state reconstruction feature learning in the recent class of diffusion policies.
arXiv Detail & Related papers (2023-07-10T17:34:23Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.