Off-Policy Risk Assessment in Contextual Bandits
- URL: http://arxiv.org/abs/2104.08977v1
- Date: Sun, 18 Apr 2021 23:27:40 GMT
- Title: Off-Policy Risk Assessment in Contextual Bandits
- Authors: Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli
- Abstract summary: We introduce the class of Lipschitz risk functionals, which subsumes many common functionals.
For Lipschitz risk functionals, the error in off-policy estimation is bounded by the error in off-policy estimation of the cumulative distribution function (CDF) of rewards.
We propose Off-Policy Risk Assessment (OPRA), an algorithm that estimates the target policy's CDF of rewards and generates a plug-in estimate of the risk.
- Score: 32.97618081988295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To evaluate prospective contextual bandit policies when experimentation is
not possible, practitioners often rely on off-policy evaluation, using data
collected under a behavioral policy. While off-policy evaluation studies
typically focus on the expected return, practitioners often care about other
functionals of the reward distribution (e.g., to express aversion to risk). In
this paper, we first introduce the class of Lipschitz risk functionals, which
subsumes many common functionals, including variance, mean-variance, and
conditional value-at-risk (CVaR). For Lipschitz risk functionals, the error in
off-policy risk estimation is bounded by the error in off-policy estimation of
the cumulative distribution function (CDF) of rewards. Second, we propose
Off-Policy Risk Assessment (OPRA), an algorithm that (i) estimates the target
policy's CDF of rewards; and (ii) generates a plug-in estimate of the risk.
Given a collection of Lipschitz risk functionals, OPRA provides estimates for
each with corresponding error bounds that hold simultaneously. We analyze both
importance sampling and variance-reduced doubly robust estimators of the CDF.
Our primary theoretical contributions are (i) the first concentration
inequalities for both types of CDF estimators and (ii) guarantees on our
Lipschitz risk functional estimates, which converge at a rate of O(1/\sqrt{n}).
For practitioners, OPRA offers a practical solution for providing
high-confidence assessments of policies using a collection of relevant metrics.
Related papers
- Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Off-Policy Risk Assessment in Markov Decision Processes [15.225153671736201]
We develop the first doubly robust (DR) estimator for the CDF of returns in Markov decision processes (MDPs)
This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound.
We derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor.
arXiv Detail & Related papers (2022-09-21T15:40:59Z) - A Risk-Sensitive Approach to Policy Optimization [21.684251937825234]
Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy.
We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized.
We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
arXiv Detail & Related papers (2022-08-19T00:55:05Z) - Supervised Learning with General Risk Functionals [28.918233583859134]
Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class.
We establish the first uniform convergence results for estimating the CDF of the loss distribution, yielding guarantees that hold simultaneously both over all H"older risk functionals and over all hypotheses.
arXiv Detail & Related papers (2022-06-27T22:11:05Z) - Risk averse non-stationary multi-armed bandits [0.0]
This paper tackles the risk averse multi-armed bandits problem when incurred losses are non-stationary.
Two estimation methods are proposed for this objective function in the presence of non-stationary losses.
Such estimates can then be embedded into classic arm selection methods such as epsilon-greedy policies.
arXiv Detail & Related papers (2021-09-28T18:34:54Z) - Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate.
We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.