Marginalized Operators for Off-policy Reinforcement Learning
- URL: http://arxiv.org/abs/2203.16177v1
- Date: Wed, 30 Mar 2022 09:59:59 GMT
- Title: Marginalized Operators for Off-policy Reinforcement Learning
- Authors: Yunhao Tang, Mark Rowland, R\'emi Munos, Michal Valko
- Abstract summary: Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases.
We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases.
- Score: 53.37381513736073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose marginalized operators, a new class of off-policy
evaluation operators for reinforcement learning. Marginalized operators
strictly generalize generic multi-step operators, such as Retrace, as special
cases. Marginalized operators also suggest a form of sample-based estimates
with potential variance reduction, compared to sample-based estimates of the
original multi-step operators. We show that the estimates for marginalized
operators can be computed in a scalable way, which also generalizes prior
results on marginalized importance sampling as special cases. Finally, we
empirically demonstrate that marginalized operators provide performance gains
to off-policy evaluation and downstream policy optimization algorithms.
Related papers
- Consistent Long-Term Forecasting of Ergodic Dynamical Systems [25.46655692714755]
We study the evolution of distributions under the action of an ergodic dynamical system.
By employing tools from Koopman and transfer operator theory one can evolve any initial distribution of the state forward in time.
We introduce a learning paradigm that neatly combines classical techniques of eigenvalue deflation from operator theory and feature centering from statistics.
arXiv Detail & Related papers (2023-12-20T21:12:19Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Learning Dynamical Systems via Koopman Operator Regression in
Reproducing Kernel Hilbert Spaces [52.35063796758121]
We formalize a framework to learn the Koopman operator from finite data trajectories of the dynamical system.
We link the risk with the estimation of the spectral decomposition of the Koopman operator.
Our results suggest RRR might be beneficial over other widely used estimators.
arXiv Detail & Related papers (2022-05-27T14:57:48Z) - Surprise Minimization Revision Operators [7.99536002595393]
We propose a measure of surprise, dubbed relative surprise, in which surprise is computed with respect to the prior belief.
We characterize the surprise minimization revision operator thus defined using a set of intuitive postulates in the AGM mould.
arXiv Detail & Related papers (2021-11-21T20:38:50Z) - Operator Augmentation for Model-based Policy Evaluation [1.503974529275767]
In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise.
We introduce an operator augmentation method for reducing the error introduced by the estimated model.
arXiv Detail & Related papers (2021-10-25T05:58:49Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - The Expected Jacobian Outerproduct: Theory and Empirics [3.172761915061083]
We show that the expected Jacobian outerproduct (EJOP) can be used as a metric to yield improvements in real-world non-parametric classification tasks.
We also show that the estimated EJOP can be used as a metric to yield improvements in metric learning tasks.
arXiv Detail & Related papers (2020-06-05T16:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.