Abstract: Offline reinforcement learning (RL) aims at learning an optimal policy from a
batch of collected data, without extra interactions with the environment during
training. Offline RL attempts to alleviate the hazardous executions in
environments, thus it will greatly broaden the scope of RL applications.
However, current offline RL benchmarks commonly have a large reality gap. They
involve large datasets collected by highly exploratory policies, and a trained
policy is directly evaluated in the environment. Meanwhile, in real-world
situations, running a highly exploratory policy is prohibited to ensure system
safety, the data is commonly very limited, and a trained policy should be well
validated before deployment. In this paper, we present a suite of near
real-world benchmarks, NewRL. NewRL contains datasets from various domains with
controlled sizes and extra test datasets for the purpose of policy validation.
We then evaluate existing offline RL algorithms on NewRL. In the experiments,
we argue that the performance of a policy should also be compared with the
deterministic version of the behavior policy, instead of the dataset reward.
Because the deterministic behavior policy is the baseline in the real
scenarios, while the dataset is often collected with action perturbations that
can degrade the performance. The empirical results demonstrate that the tested
offline RL algorithms appear only competitive to the above deterministic policy
on many datasets, and the offline policy evaluation hardly helps. The NewRL
suit can be found at http://polixir.ai/research/new rl. We hope this work will
shed some light on research and draw more attention when deploying RL in