Off-Policy Evaluation via the Regularized Lagrangian
- URL: http://arxiv.org/abs/2007.03438v2
- Date: Fri, 24 Jul 2020 21:32:33 GMT
- Title: Off-Policy Evaluation via the Regularized Lagrangian
- Authors: Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans
- Abstract summary: Recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data.
In this paper, we unify these estimators as regularized Lagrangians of the same linear program.
We find that dual solutions offer greater flexibility in navigating the tradeoff between stability and estimation bias, and generally provide superior estimates in practice.
- Score: 110.28927184857478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed distribution correction estimation (DICE) family of
estimators has advanced the state of the art in off-policy evaluation from
behavior-agnostic data. While these estimators all perform some form of
stationary distribution correction, they arise from different derivations and
objective functions. In this paper, we unify these estimators as regularized
Lagrangians of the same linear program. The unification allows us to expand the
space of DICE estimators to new alternatives that demonstrate improved
performance. More importantly, by analyzing the expanded space of estimators
both mathematically and empirically we find that dual solutions offer greater
flexibility in navigating the tradeoff between optimization stability and
estimation bias, and generally provide superior estimates in practice.
Related papers
- Primal-Dual Spectral Representation for Off-policy Evaluation [39.24759979398673]
Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL)
We show that our algorithm, SpectralDICE, is both primal and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.
arXiv Detail & Related papers (2024-10-23T03:38:31Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Off-Policy Evaluation for Large Action Spaces via Conjunct Effect
Modeling [30.835774920236872]
We study off-policy evaluation of contextual bandit policies for large discrete action spaces.
We propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect.
Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
arXiv Detail & Related papers (2023-05-14T04:16:40Z) - Off-policy evaluation for learning-to-rank via interpolating the
item-position model and the position-based model [83.83064559894989]
A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production.
We develop a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings.
In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model.
arXiv Detail & Related papers (2022-10-15T17:22:30Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.