Near Real-World Benchmarks for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2102.00714v1
- Date: Mon, 1 Feb 2021 09:19:10 GMT
- Title: Near Real-World Benchmarks for Offline Reinforcement Learning
- Authors: Rongjun Qin, Songyi Gao, Xingyuan Zhang, Zhen Xu, Shengkai Huang,
Zewen Li, Weinan Zhang, Yang Yu
- Abstract summary: We present a suite of near real-world benchmarks, NewRL.
NewRL contains datasets from various domains with controlled sizes and extra test datasets for the purpose of policy validation.
We argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward.
- Score: 26.642722521820467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) aims at learning an optimal policy from a
batch of collected data, without extra interactions with the environment during
training. Offline RL attempts to alleviate the hazardous executions in
environments, thus it will greatly broaden the scope of RL applications.
However, current offline RL benchmarks commonly have a large reality gap. They
involve large datasets collected by highly exploratory policies, and a trained
policy is directly evaluated in the environment. Meanwhile, in real-world
situations, running a highly exploratory policy is prohibited to ensure system
safety, the data is commonly very limited, and a trained policy should be well
validated before deployment. In this paper, we present a suite of near
real-world benchmarks, NewRL. NewRL contains datasets from various domains with
controlled sizes and extra test datasets for the purpose of policy validation.
We then evaluate existing offline RL algorithms on NewRL. In the experiments,
we argue that the performance of a policy should also be compared with the
deterministic version of the behavior policy, instead of the dataset reward.
Because the deterministic behavior policy is the baseline in the real
scenarios, while the dataset is often collected with action perturbations that
can degrade the performance. The empirical results demonstrate that the tested
offline RL algorithms appear only competitive to the above deterministic policy
on many datasets, and the offline policy evaluation hardly helps. The NewRL
suit can be found at http://polixir.ai/research/newrl. We hope this work will
shed some light on research and draw more attention when deploying RL in
real-world systems.
Related papers
- Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z) - RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning [108.9599280270704]
We propose a benchmark called RL Unplugged to evaluate and compare offline RL methods.
RL Unplugged includes data from a diverse range of domains including games and simulated motor control problems.
We will release data for all our tasks and open-source all algorithms presented in this paper.
arXiv Detail & Related papers (2020-06-24T17:14:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.