TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from
Mixed Datasets
- URL: http://arxiv.org/abs/2212.02125v1
- Date: Mon, 5 Dec 2022 09:36:23 GMT
- Title: TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from
Mixed Datasets
- Authors: Yuanying Cai, Chuheng Zhang, Li Zhao, Wei Shen, Xuyun Zhang, Lei Song,
Jiang Bian, Tao Qin, Tieyan Liu
- Abstract summary: We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies.
There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies.
In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm.
- Score: 118.22975463000928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider an offline reinforcement learning (RL) setting where the agent
need to learn from a dataset collected by rolling out multiple behavior
policies. There are two challenges for this setting: 1) The optimal trade-off
between optimizing the RL signal and the behavior cloning (BC) signal changes
on different states due to the variation of the action coverage induced by
different behavior policies. Previous methods fail to handle this by only
controlling the global trade-off. 2) For a given state, the action distribution
generated by different behavior policies may have multiple modes. The BC
regularizers in many previous methods are mean-seeking, resulting in policies
that select out-of-distribution (OOD) actions in the middle of the modes. In
this paper, we address both challenges by using adaptively weighted reverse
Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3
algorithm. Our method not only trades off the RL and BC signals with per-state
weights (i.e., strong BC regularization on the states with narrow action
coverage, and vice versa) but also avoids selecting OOD actions thanks to the
mode-seeking property of reverse KL. Empirically, our algorithm can outperform
existing offline RL algorithms in the MuJoCo locomotion tasks with the standard
D4RL datasets as well as the mixed datasets that combine the standard datasets.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.