SOPE: Spectrum of Off-Policy Estimators
- URL: http://arxiv.org/abs/2111.03936v1
- Date: Sat, 6 Nov 2021 18:29:21 GMT
- Title: SOPE: Spectrum of Off-Policy Estimators
- Authors: Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas,
Scott Niekum
- Abstract summary: We show the existence of a spectrum of estimators whose endpoints are SIS and IS.
We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS.
- Score: 40.15700429288981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many sequential decision making problems are high-stakes and require
off-policy evaluation (OPE) of a new policy using historical data collected
using some other policy. One of the most common OPE techniques that provides
unbiased estimates is trajectory based importance sampling (IS). However, due
to the high variance of trajectory IS estimates, importance sampling methods
based on state-action visitation distributions (SIS) have recently been
adopted. Unfortunately, while SIS often provides lower variance estimates for
long horizons, estimating the state-action distribution ratios can be
challenging and lead to biased estimates. In this paper, we present a new
perspective on this bias-variance trade-off and show the existence of a
spectrum of estimators whose endpoints are SIS and IS. Additionally, we also
establish a spectrum for doubly-robust and weighted version of these
estimators. We provide empirical evidence that estimators in this spectrum can
be used to trade-off between the bias and variance of IS and SIS and can
achieve lower mean-squared error than both IS and SIS.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Scaling Marginalized Importance Sampling to High-Dimensional
State-Spaces via State Abstraction [5.150752343250592]
We consider the problem of off-policy evaluation in reinforcement learning (RL)
We propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space.
arXiv Detail & Related papers (2022-12-14T20:07:33Z) - Excess risk analysis for epistemic uncertainty with application to
variational inference [110.4676591819618]
We present a novel EU analysis in the frequentist setting, where data is generated from an unknown distribution.
We show a relation between the generalization ability and the widely used EU measurements, such as the variance and entropy of the predictive distribution.
We propose new variational inference that directly controls the prediction and EU evaluation performances based on the PAC-Bayesian theory.
arXiv Detail & Related papers (2022-06-02T12:12:24Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - State Relevance for Off-Policy Evaluation [29.891687579606277]
We introduce Omitting-States-Irrelevant-to-Return Importance Sampling (OSIRIS), an estimator which reduces variance by strategically omitting likelihood ratios associated with certain states.
We formalize the conditions under which OSIRIS is unbiased and has lower variance than ordinary importance sampling.
arXiv Detail & Related papers (2021-09-13T20:40:55Z) - Off-Policy Evaluation via Adaptive Weighting with Data from Contextual
Bandits [5.144809478361604]
We improve the doubly robust (DR) estimator by adaptively weighting observations to control its variance.
We provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
arXiv Detail & Related papers (2021-06-03T17:54:44Z) - Universal Off-Policy Evaluation [64.02853483874334]
We take the first steps towards a universal off-policy estimator (UnO)
We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns.
arXiv Detail & Related papers (2021-04-26T18:54:31Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z) - The Counterfactual $\chi$-GAN [20.42556178617068]
Causal inference often relies on the counterfactual framework, which requires that treatment assignment is independent of the outcome.
This work proposes a generative adversarial network (GAN)-based model called the Counterfactual $chi$-GAN (cGAN)
arXiv Detail & Related papers (2020-01-09T17:23:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.