Related papers: Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

URL: http://arxiv.org/abs/2202.09667v1
Date: Sat, 19 Feb 2022 20:00:44 GMT
Title: Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning
Authors: Nathan Kallus, Xiaojie Mao, Kaiwen Wang, Zhengyuan Zhou
Abstract summary: Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting. We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
Score: 59.02006924867438
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR$^2$OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR$^2$OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR$^2$OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of $\mathcal{O}(N^{-1/2})$ even when unknown propensities are nonparametrically estimated. We further extend our results to general $f$-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.

Related papers

Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making. We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv Detail & Related papers (2025-03-23T10:52:36Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Double Cross-fit Doubly Robust Estimators: Beyond Series Regression [13.595329873577839]
Doubly robust estimators with cross-fitting have gained popularity in causal inference due to their favorable structure-agnostic error guarantees. "double cross-fit doubly robust" (DCDR) estimators can be constructed by splitting the training data and undersmoothing nuisance function estimators on independent samples. We show that an undersmoothed DCDR estimator satisfies a slower-than-$sqrtn$ central limit, and that inference is possible even in the non-$sqrtn$ regime.
arXiv Detail & Related papers (2024-03-22T12:59:03Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Doubly Robust Proximal Causal Learning for Continuous Treatments [56.05592840537398]
We propose a kernel-based doubly robust causal learning estimator for continuous treatments. We show that its oracle form is a consistent approximation of the influence function. We then provide a comprehensive convergence analysis in terms of the mean square error.
arXiv Detail & Related papers (2023-09-22T12:18:53Z)
Off-Policy Risk Assessment in Markov Decision Processes [15.225153671736201]
We develop the first doubly robust (DR) estimator for the CDF of returns in Markov decision processes (MDPs) This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound. We derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor.
arXiv Detail & Related papers (2022-09-21T15:40:59Z)
StableDR: Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random [16.700598755439685]
We show that the doubly robust (DR) methods are unstable and have unbounded bias, variance, and generalization bounds to extremely small propensities. We propose a doubly robust (StableDR) learning approach with a weaker reliance on extrapolation. In addition, we propose a novel learning approach for StableDR that updates the imputation, propensity, and prediction models cyclically.
arXiv Detail & Related papers (2022-05-10T07:04:53Z)
Doubly-Robust Estimation for Unbiased Learning-to-Rank from Position-Biased Click Feedback [13.579420996461439]
We introduce a novel DR estimator that uses the expectation of treatment per rank instead of IPS estimation. Our results indicate it requires several orders of magnitude fewer datapoints to converge at optimal performance.
arXiv Detail & Related papers (2022-03-31T15:38:25Z)
Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation [29.27760413892272]
Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems. Currently, most existing methods utilize counterfactual learning to debias recommender systems. We propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation.
arXiv Detail & Related papers (2021-05-28T06:59:49Z)
Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.