Latent-Variable Advantage-Weighted Policy Optimization for Offline RL
- URL: http://arxiv.org/abs/2203.08949v1
- Date: Wed, 16 Mar 2022 21:17:03 GMT
- Title: Latent-Variable Advantage-Weighted Policy Optimization for Offline RL
- Authors: Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Yuan Gao, Jianhao Wang, Wenzhe
Li, Bin Liang, Chelsea Finn and Chongjie Zhang
- Abstract summary: offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
- Score: 70.01851346635637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning methods hold the promise of learning policies
from pre-collected datasets without the need to query the environment for new
transitions. This setting is particularly well-suited for continuous control
robotic applications for which online data collection based on trial-and-error
is costly and potentially unsafe. In practice, offline datasets are often
heterogeneous, i.e., collected in a variety of scenarios, such as data from
several human demonstrators or from policies that act with different purposes.
Unfortunately, such datasets can exacerbate the distribution shift between the
behavior policy underlying the data and the optimal policy to be learned,
leading to poor performance. To address this challenge, we propose to leverage
latent-variable policies that can represent a broader class of policy
distributions, leading to better adherence to the training data distribution
while maximizing reward via a policy over the latent variable. As we
empirically show on a range of simulated locomotion, navigation, and
manipulation tasks, our method referred to as latent-variable
advantage-weighted policy optimization (LAPO), improves the average performance
of the next best-performing offline reinforcement learning methods by 49% on
heterogeneous datasets, and by 8% on datasets with narrow and biased
distributions.
Related papers
- DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for Offline Reinforcement Learning [22.323173093804897]
offline reinforcement learning can learn optimal policies from pre-collected offline datasets without interacting with the environment.
Recent works address this issue by employing generative adversarial networks (GANs)
Inspired by the diffusion, we propose a new offline RL method named Diffusion Policies with Generative Adversarial Networks (DiffPoGAN)
arXiv Detail & Related papers (2024-06-13T13:15:40Z) - A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective [29.977702744504466]
We introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning.
A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies.
Experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts.
arXiv Detail & Related papers (2024-03-12T02:43:41Z) - Dataset Clustering for Improved Offline Policy Learning [7.873623003095065]
offline policy learning aims to discover decision-making policies from previously-collected datasets without additional online interactions with the environment.
This paper studies a dataset characteristic that we refer to as multi-behavior, indicating that the dataset is collected using multiple policies that exhibit distinct behaviors.
We propose a behavior-aware deep clustering approach that partitions multi-behavior datasets into several uni-behavior subsets.
arXiv Detail & Related papers (2024-02-14T20:01:41Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z) - Model-based trajectory stitching for improved behavioural cloning and
its applications [7.462336024223669]
Trajectory Stitching (TS) generates new trajectories by stitching' pairs of states that were disconnected in the original data.
We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy.
arXiv Detail & Related papers (2022-12-08T14:18:04Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Regularizing a Model-based Policy Stationary Distribution to Stabilize
Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets.
A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy.
We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.