Off-Policy Evaluation with Policy-Dependent Optimization Response
- URL: http://arxiv.org/abs/2202.12958v1
- Date: Fri, 25 Feb 2022 20:25:37 GMT
- Title: Off-Policy Evaluation with Policy-Dependent Optimization Response
- Authors: Wenshuo Guo, Michael I. Jordan, Angela Zhou
- Abstract summary: We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
- Score: 90.28758112893054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The intersection of causal inference and machine learning for decision-making
is rapidly expanding, but the default decision criterion remains an
\textit{average} of individual causal outcomes across a population. In
practice, various operational restrictions ensure that a decision-maker's
utility is not realized as an \textit{average} but rather as an \textit{output}
of a downstream decision-making problem (such as matching, assignment, network
flow, minimizing predictive risk). In this work, we develop a new framework for
off-policy evaluation with a \textit{policy-dependent} linear optimization
response: causal outcomes introduce stochasticity in objective function
coefficients. In this framework, a decision-maker's utility depends on the
policy-dependent optimization, which introduces a fundamental challenge of
\textit{optimization} bias even for the case of policy evaluation. We construct
unbiased estimators for the policy-dependent estimand by a perturbation method.
We also discuss the asymptotic variance properties for a set of plug-in
regression estimators adjusted to be compatible with that perturbation method.
Lastly, attaining unbiased policy evaluation allows for policy optimization,
and we provide a general algorithm for optimizing causal interventions. We
corroborate our theoretical results with numerical simulations.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Randomized Policy Optimization for Optimal Stopping [0.0]
We propose a new methodology for optimal stopping based on randomized linear policies.
We show that our approach can substantially outperform state-of-the-art methods.
arXiv Detail & Related papers (2022-03-25T04:33:15Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.