Supported Policy Optimization for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2202.06239v1
- Date: Sun, 13 Feb 2022 07:38:36 GMT
- Title: Supported Policy Optimization for Offline Reinforcement Learning
- Authors: Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, Mingsheng Long
- Abstract summary: Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
- Score: 74.1011309005488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy constraint methods to offline reinforcement learning (RL) typically
utilize parameterization or regularization that constrains the policy to
perform actions within the support set of the behavior policy. The elaborative
designs of parameterization methods usually intrude into the policy networks,
which may bring extra inference cost and cannot take full advantage of
well-established online methods. Regularization methods reduce the divergence
between the learned policy and the behavior policy, which may mismatch the
inherent density-based definition of support set thereby failing to avoid the
out-of-distribution actions effectively. This paper presents Supported Policy
OpTimization (SPOT), which is directly derived from the theoretical
formalization of the density-based support constraint. SPOT adopts a VAE-based
density estimator to explicitly model the support set of behavior policy and
presents a simple but effective density-based regularization term, which can be
plugged non-intrusively into off-the-shelf off-policy RL algorithms. On the
standard benchmarks for offline RL, SPOT substantially outperforms
state-of-the-art offline RL methods. Benefiting from the pluggable design, the
offline pretrained models from SPOT can also be applied to perform online
fine-tuning seamlessly.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning [19.533619091287676]
We propose a novel preferred-action-optimized diffusion policy for offline reinforcement learning.
In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy.
Experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods.
arXiv Detail & Related papers (2024-05-29T03:19:59Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Regularizing a Model-based Policy Stationary Distribution to Stabilize
Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets.
A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy.
We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.