Dual Generator Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2211.01471v1
- Date: Wed, 2 Nov 2022 20:25:18 GMT
- Title: Dual Generator Offline Reinforcement Learning
- Authors: Quan Vuong, Aviral Kumar, Sergey Levine, Yevgen Chebotar
- Abstract summary: In offline RL, constraining the learned policy to remain close to the data is essential.
In practice, GAN-based offline RL methods have not performed as well as alternative approaches.
We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint.
- Score: 90.05278061564198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In offline RL, constraining the learned policy to remain close to the data is
essential to prevent the policy from outputting out-of-distribution (OOD)
actions with erroneously overestimated values. In principle, generative
adversarial networks (GAN) can provide an elegant solution to do so, with the
discriminator directly providing a probability that quantifies distributional
shift. However, in practice, GAN-based offline RL methods have not performed as
well as alternative approaches, perhaps because the generator is trained to
both fool the discriminator and maximize return -- two objectives that can be
at odds with each other. In this paper, we show that the issue of conflicting
objectives can be resolved by training two generators: one that maximizes
return, with the other capturing the ``remainder'' of the data distribution in
the offline dataset, such that the mixture of the two is close to the behavior
policy. We show that not only does having two generators enable an effective
GAN-based offline RL method, but also approximates a support constraint, where
the policy does not need to match the entire data distribution, but only the
slice of the data that leads to high long term performance. We name our method
DASCO, for Dual-Generator Adversarial Support Constrained Offline RL. On
benchmark tasks that require learning from sub-optimal data, DASCO
significantly outperforms prior methods that enforce distribution constraint.
Related papers
- DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for Offline Reinforcement Learning [22.323173093804897]
offline reinforcement learning can learn optimal policies from pre-collected offline datasets without interacting with the environment.
Recent works address this issue by employing generative adversarial networks (GANs)
Inspired by the diffusion, we propose a new offline RL method named Diffusion Policies with Generative Adversarial Networks (DiffPoGAN)
arXiv Detail & Related papers (2024-06-13T13:15:40Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning [66.43003402281659]
A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset.
We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL.
The proposed algorithm does not require any reward information during data collection.
arXiv Detail & Related papers (2023-05-17T15:17:23Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.