Adaptive Estimator Selection for Off-Policy Evaluation
- URL: http://arxiv.org/abs/2002.07729v2
- Date: Mon, 24 Aug 2020 14:54:22 GMT
- Title: Adaptive Estimator Selection for Off-Policy Evaluation
- Authors: Yi Su, Pavithra Srinath, Akshay Krishnamurthy
- Abstract summary: We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings.
We establish a strong performance guarantee for the method, showing that it is competitive with the oracle estimator, up to a constant factor.
- Score: 48.66170976187225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a generic data-driven method for estimator selection in off-policy
policy evaluation settings. We establish a strong performance guarantee for the
method, showing that it is competitive with the oracle estimator, up to a
constant factor. Via in-depth case studies in contextual bandits and
reinforcement learning, we demonstrate the generality and applicability of the
method. We also perform comprehensive experiments, demonstrating the empirical
efficacy of our approach and comparing with related approaches. In both case
studies, our method compares favorably with existing methods.
Related papers
- Online Estimation and Inference for Robust Policy Evaluation in
Reinforcement Learning [7.875680651592574]
We develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation.
This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation.
arXiv Detail & Related papers (2023-10-04T04:57:35Z) - Counterfactual Learning with General Data-generating Policies [3.441021278275805]
We develop an OPE method for a class of full support and deficient support logging policies in contextual-bandit settings.
We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases.
arXiv Detail & Related papers (2022-12-04T21:07:46Z) - Efficient Real-world Testing of Causal Decision Making via Bayesian
Experimental Design for Contextual Optimisation [12.37745209793872]
We introduce a model-agnostic framework for gathering data to evaluate and improve contextual decision making.
Our method is used for the data-efficient evaluation of the regret of past treatment assignments.
arXiv Detail & Related papers (2022-07-12T01:20:11Z) - Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z) - Safe Exploration for Efficient Policy Evaluation and Comparison [20.97686379166058]
We study efficient and safe data collection for bandit policy evaluation.
For each variant, we analyze its statistical properties, derive the corresponding exploration policy, and design an efficient algorithm for computing it.
arXiv Detail & Related papers (2022-02-26T21:41:44Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.