Towards Hyperparameter-free Policy Selection for Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2110.14000v2
- Date: Thu, 28 Oct 2021 02:06:13 GMT
- Title: Towards Hyperparameter-free Policy Selection for Offline Reinforcement
Learning
- Authors: Siyuan Zhang, Nan Jiang
- Abstract summary: We show how to select between policies and value functions produced by different training algorithms in offline reinforcement learning.
We use BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari.
- Score: 10.457660611114457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to select between policies and value functions produced by different
training algorithms in offline reinforcement learning (RL) -- which is crucial
for hyperpa-rameter tuning -- is an important open question. Existing
approaches based on off-policy evaluation (OPE) often require additional
function approximation and hence hyperparameters, creating a chicken-and-egg
situation. In this paper, we design hyperparameter-free algorithms for policy
selection based on BVFT [XJ21], a recent theoretical advance in value-function
selection, and demonstrate their effectiveness in discrete-action benchmarks
such as Atari. To address performance degradation due to poor critics in
continuous-action domains, we further combine BVFT with OPE to get the best of
both worlds, and obtain a hyperparameter-tuning method for Q-function based OPE
with theoretical guarantees as a side product.
Related papers
- On the consistency of hyper-parameter selection in value-based deep reinforcement learning [13.133865673667394]
This paper conducts an empirical study focusing on the reliability of hyper- parameter selection for value-based deep reinforcement learning agents.
Our findings help establish which hyper- parameters are most critical to tune, and help clarify which tunings remain consistent across different training regimes.
arXiv Detail & Related papers (2024-06-25T13:06:09Z) - Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)
VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.
Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z) - Stepsize Learning for Policy Gradient Methods in Contextual Markov
Decision Processes [35.889129338603446]
Policy-based algorithms are among the most widely adopted techniques in model-free RL.
They tend to struggle when asked to accomplish a series of heterogeneous tasks.
We introduce a new formulation, known as meta-MDP, that can be used to solve any hyper parameter selection problem in RL.
arXiv Detail & Related papers (2023-06-13T12:58:12Z) - Proximal Point Imitation Learning [48.50107891696562]
We develop new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning.
We leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing.
We achieve convincing empirical performance for both linear and neural network function approximation.
arXiv Detail & Related papers (2022-09-22T12:40:21Z) - Towards Learning Universal Hyperparameter Optimizers with Transformers [57.35920571605559]
We introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction.
Our experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates.
arXiv Detail & Related papers (2022-05-26T12:51:32Z) - Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space.
We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z) - A Theoretical Framework of Almost Hyperparameter-free Hyperparameter
Selection Methods for Offline Policy Evaluation [2.741266294612776]
offline reinforcement learning (OPE) is a core technology for data-driven decision optimization without environment simulators.
We introduce a new approximate hyper parameter selection framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner.
We derive four AHS methods each of which has different characteristics such as convergence rate and time complexity.
arXiv Detail & Related papers (2022-01-07T02:23:09Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Hyperparameter Selection for Offline Reinforcement Learning [61.92834684647419]
offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios.
Existing hyperparameter selection methods for offline RL break the offline assumption.
arXiv Detail & Related papers (2020-07-17T15:30:38Z) - Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary
Strategies [41.13416324282365]
We propose a framework which entails the application of Evolutionary Strategies to online hyper- parameter tuning in off-policy learning.
Our formulation draws close connections to meta-gradients and leverages the strengths of black-box optimization with relatively low-dimensional search spaces.
arXiv Detail & Related papers (2020-06-13T03:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.