Local Metric Learning for Off-Policy Evaluation in Contextual Bandits
with Continuous Actions
- URL: http://arxiv.org/abs/2210.13373v2
- Date: Tue, 25 Oct 2022 20:33:22 GMT
- Title: Local Metric Learning for Off-Policy Evaluation in Contextual Bandits
with Continuous Actions
- Authors: Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee,
Yung-Kyun Noh, Kee-Eung Kim
- Abstract summary: We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces.
We present an analytic solution for the optimal metric, based on the analysis of bias and variance.
- Score: 33.96450847451234
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We consider local kernel metric learning for off-policy evaluation (OPE) of
deterministic policies in contextual bandits with continuous action spaces. Our
work is motivated by practical scenarios where the target policy needs to be
deterministic due to domain requirements, such as prescription of treatment
dosage and duration in medicine. Although importance sampling (IS) provides a
basic principle for OPE, it is ill-posed for the deterministic target policy
with continuous actions. Our main idea is to relax the target policy and pose
the problem as kernel-based estimation, where we learn the kernel metric in
order to minimize the overall mean squared error (MSE). We present an analytic
solution for the optimal metric, based on the analysis of bias and variance.
Whereas prior work has been limited to scalar action spaces or kernel bandwidth
selection, our work takes a step further being capable of vector action spaces
and metric optimization. We show that our estimator is consistent, and
significantly reduces the MSE compared to baseline OPE methods through
experiments on various domains.
Related papers
- Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies [24.706986328622193]
We consider off-policy evaluation of deterministic target policies for reinforcement learning.
We learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function.
We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric.
arXiv Detail & Related papers (2024-05-29T06:17:33Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Statistical Inference of the Value Function for Reinforcement Learning
in Infinite Horizon Settings [0.0]
We construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity.
We show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique.
We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
arXiv Detail & Related papers (2020-01-13T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.