Related papers: Conformal Off-Policy Evaluation in Markov Decision Processes

Conformal Off-Policy Evaluation in Markov Decision Processes

URL: http://arxiv.org/abs/2304.02574v2
Date: Tue, 19 Sep 2023 14:40:48 GMT
Title: Conformal Off-Policy Evaluation in Markov Decision Processes
Authors: Daniele Foffano, Alessio Russo, Alexandre Proutiere
Abstract summary: Reinforcement Learning aims at identifying and evaluating efficient control policies from data. Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
Score: 53.786439742572995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.

Related papers

Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment. We first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z)
Counterfactual Learning with General Data-generating Policies [3.441021278275805]
We develop an OPE method for a class of full support and deficient support logging policies in contextual-bandit settings. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases.
arXiv Detail & Related papers (2022-12-04T21:07:46Z)
Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation [60.71312668265873]
We develop a method to balance the need for personalization with confident predictions. We show that our method can be used to form accurate predictions of heterogeneous treatment effects.
arXiv Detail & Related papers (2021-11-28T23:19:12Z)
Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance. Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z)
Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed. This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z)
A Practical Guide of Off-Policy Evaluation for Bandit Problems [13.607327477092877]
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. We propose a meta-algorithm based on existing OPE estimators. We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
arXiv Detail & Related papers (2020-10-23T15:11:19Z)
Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications. Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.