Let Offline RL Flow: Training Conservative Agents in the Latent Space of
Normalizing Flows
- URL: http://arxiv.org/abs/2211.11096v1
- Date: Sun, 20 Nov 2022 21:57:10 GMT
- Title: Let Offline RL Flow: Training Conservative Agents in the Latent Space of
Normalizing Flows
- Authors: Dmitriy Akimov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov,
Sergey Kolesnikov
- Abstract summary: offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions.
We build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model.
We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms.
- Score: 58.762959061522736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning aims to train a policy on a pre-recorded and
fixed dataset without any additional environment interactions. There are two
major challenges in this setting: (1) extrapolation error caused by
approximating the value of state-action pairs not well-covered by the training
data and (2) distributional shift between behavior and inference policies. One
way to tackle these problems is to induce conservatism - i.e., keeping the
learned policies closer to the behavioral ones. To achieve this, we build upon
recent works on learning policies in latent action spaces and use a special
form of Normalizing Flows for constructing a generative model, which we use as
a conservative action encoder. This Normalizing Flows action encoder is
pre-trained in a supervised manner on the offline dataset, and then an
additional policy model - controller in the latent space - is trained via
reinforcement learning. This approach avoids querying actions outside of the
training dataset and therefore does not require additional regularization for
out-of-dataset actions. We evaluate our method on various locomotion and
navigation tasks, demonstrating that our approach outperforms recently proposed
algorithms with generative action models on a large portion of datasets.
Related papers
- Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control [25.219524290912048]
We formulate offline Reinforcement Learning as a two-stage optimization problem.
First, we pretrain expressive generative policies on reward-free behavior datasets, then fine-tune these policies to align with task-specific annotations like Q-values.
This strategy allows us to leverage abundant and diverse behavior data to enhance generalization and enable rapid adaptation to downstream tasks using minimal annotations.
arXiv Detail & Related papers (2024-07-12T06:32:36Z) - Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy.
We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode.
We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - PLAS: Latent Action Space for Offline Reinforcement Learning [18.63424441772675]
The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment.
Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions.
We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets.
arXiv Detail & Related papers (2020-11-14T03:38:38Z) - Keep Doing What Worked: Behavioral Modelling Priors for Offline
Reinforcement Learning [25.099754758455415]
Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set of environment interactions is available.
Standard off-policy algorithms fail in the batch setting for continuous control.
arXiv Detail & Related papers (2020-02-19T19:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.