Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation
- URL: http://arxiv.org/abs/2106.13125v1
- Date: Thu, 24 Jun 2021 15:58:01 GMT
- Title: Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation
- Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, R\'emi Munos, Michal Valko
- Abstract summary: We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
- Score: 53.83642844626703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-agnostic meta-reinforcement learning requires estimating the Hessian
matrix of value functions. This is challenging from an implementation
perspective, as repeatedly differentiating policy gradient estimates may lead
to biased Hessian estimates. In this work, we provide a unifying framework for
estimating higher-order derivatives of value functions, based on off-policy
evaluation. Our framework interprets a number of prior approaches as special
cases and elucidates the bias and variance trade-off of Hessian estimates. This
framework also opens the door to a new family of estimates, which can be easily
implemented with auto-differentiation libraries, and lead to performance gains
in practice.
Related papers
- Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - An Analysis of Measure-Valued Derivatives for Policy Gradients [37.241788708646574]
We study a different type of gradient estimator - the Measure-Valued Derivative.
This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators.
We show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks.
arXiv Detail & Related papers (2022-03-08T08:26:31Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Estimation Error Correction in Deep Reinforcement Learning for
Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.
We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises.
To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Taylor Expansion of Discount Factors [56.46324239692532]
In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective.
In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors.
arXiv Detail & Related papers (2021-06-11T05:02:17Z) - Bootstrapping Statistical Inference for Off-Policy Evaluation [43.79456564713911]
We study the use of bootstrapping in off-policy evaluation (OPE)
We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is efficient and consistent for off-policy statistical inference.
We evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
arXiv Detail & Related papers (2021-02-06T16:45:33Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Off-Policy Evaluation via the Regularized Lagrangian [110.28927184857478]
Recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data.
In this paper, we unify these estimators as regularized Lagrangians of the same linear program.
We find that dual solutions offer greater flexibility in navigating the tradeoff between stability and estimation bias, and generally provide superior estimates in practice.
arXiv Detail & Related papers (2020-07-07T13:45:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.