Hyperparameter Selection for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2007.09055v1
- Date: Fri, 17 Jul 2020 15:30:38 GMT
- Title: Hyperparameter Selection for Offline Reinforcement Learning
- Authors: Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad
Zolna, Alexander Novikov, Ziyu Wang, Nando de Freitas
- Abstract summary: offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios.
Existing hyperparameter selection methods for offline RL break the offline assumption.
- Score: 61.92834684647419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL purely from logged data) is an important
avenue for deploying RL techniques in real-world scenarios. However, existing
hyperparameter selection methods for offline RL break the offline assumption by
evaluating policies corresponding to each hyperparameter setting in the
environment. This online execution is often infeasible and hence undermines the
main aim of offline RL. Therefore, in this work, we focus on \textit{offline
hyperparameter selection}, i.e. methods for choosing the best policy from a set
of many policies trained using different hyperparameters, given only logged
data. Through large-scale empirical evaluation we show that: 1) offline RL
algorithms are not robust to hyperparameter choices, 2) factors such as the
offline RL algorithm and method for estimating Q values can have a big impact
on hyperparameter selection, and 3) when we control those factors carefully, we
can reliably rank policies across hyperparameter choices, and therefore choose
policies which are close to the best policy in the set. Overall, our results
present an optimistic view that offline hyperparameter selection is within
reach, even in challenging tasks with pixel observations, high dimensional
action spaces, and long horizon.
Related papers
- AutoRL Hyperparameter Landscapes [69.15927869840918]
Reinforcement Learning (RL) has shown to be capable of producing impressive results, but its use is limited by the impact of its hyperparameters on performance.
We propose an approach to build and analyze these hyperparameter landscapes not just for one point in time but at multiple points in time throughout training.
This supports the theory that hyperparameters should be dynamically adjusted during training and shows the potential for more insights on AutoRL problems that can be gained through landscape analyses.
arXiv Detail & Related papers (2023-04-05T12:14:41Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Data-Efficient Pipeline for Offline Reinforcement Learning with Limited
Data [28.846826115837825]
offline reinforcement learning can be used to improve future performance by leveraging historical data.
We introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy.
We show it can have substantial impacts when the dataset is small.
arXiv Detail & Related papers (2022-10-16T21:24:53Z) - Offline RL Policies Should be Trained to be Adaptive [89.8580376798065]
We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP.
As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.
We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.
arXiv Detail & Related papers (2022-07-05T17:58:33Z) - No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL [28.31529154045046]
We propose a new approach to tune hyperparameters from offline logs of data.
We first learn a model of the environment from the offline data, which we call a calibration model, and then simulate learning in the calibration model.
We empirically investigate the method in a variety of settings to identify when it is effective and when it fails.
arXiv Detail & Related papers (2022-05-18T04:26:23Z) - A Theoretical Framework of Almost Hyperparameter-free Hyperparameter
Selection Methods for Offline Policy Evaluation [2.741266294612776]
offline reinforcement learning (OPE) is a core technology for data-driven decision optimization without environment simulators.
We introduce a new approximate hyper parameter selection framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner.
We derive four AHS methods each of which has different characteristics such as convergence rate and time complexity.
arXiv Detail & Related papers (2022-01-07T02:23:09Z) - Pessimistic Model Selection for Offline Deep Reinforcement Learning [56.282483586473816]
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications.
One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL.
We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee.
arXiv Detail & Related papers (2021-11-29T06:29:49Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.