Off-Policy Risk Assessment in Markov Decision Processes
- URL: http://arxiv.org/abs/2209.10444v1
- Date: Wed, 21 Sep 2022 15:40:59 GMT
- Title: Off-Policy Risk Assessment in Markov Decision Processes
- Authors: Audrey Huang, Liu Leqi, Zachary Chase Lipton, Kamyar Azizzadenesheli
- Abstract summary: We develop the first doubly robust (DR) estimator for the CDF of returns in Markov decision processes (MDPs)
This estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound.
We derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant factor.
- Score: 15.225153671736201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Addressing such diverse ends as safety alignment with human preferences, and
the efficiency of learning, a growing line of reinforcement learning research
focuses on risk functionals that depend on the entire distribution of returns.
Recent work on \emph{off-policy risk assessment} (OPRA) for contextual bandits
introduced consistent estimators for the target policy's CDF of returns along
with finite sample guarantees that extend to (and hold simultaneously over) all
risk. In this paper, we lift OPRA to Markov decision processes (MDPs), where
importance sampling (IS) CDF estimators suffer high variance on longer
trajectories due to small effective sample size. To mitigate these problems, we
incorporate model-based estimation to develop the first doubly robust (DR)
estimator for the CDF of returns in MDPs. This estimator enjoys significantly
less variance and, when the model is well specified, achieves the Cramer-Rao
variance lower bound. Moreover, for many risk functionals, the downstream
estimates enjoy both lower bias and lower variance. Additionally, we derive the
first minimax lower bounds for off-policy CDF and risk estimation, which match
our error bounds up to a constant factor. Finally, we demonstrate the precision
of our DR CDF estimates experimentally on several different environments.
Related papers
- Data-driven decision-making under uncertainty with entropic risk measure [5.407319151576265]
The entropic risk measure is widely used in high-stakes decision making to account for tail risks associated with an uncertain loss.
To debias the empirical entropic risk estimator, we propose a strongly consistent bootstrapping procedure.
We show that cross validation methods can result in significantly higher out-of-sample risk for the insurer if the bias in validation performance is not corrected for.
arXiv Detail & Related papers (2024-09-30T04:02:52Z) - Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk [23.63388546004777]
We analyze the robustness of CVaR-based risk-sensitive RL under Robust Markov Decision Processes.
Motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets.
arXiv Detail & Related papers (2024-05-02T20:28:49Z) - Provable Risk-Sensitive Distributional Reinforcement Learning with
General Function Approximation [54.61816424792866]
We introduce a general framework on Risk-Sensitive Distributional Reinforcement Learning (RS-DisRL), with static Lipschitz Risk Measures (LRM) and general function approximation.
We design two innovative meta-algorithms: textttRS-DisRL-M, a model-based strategy for model-based function approximation, and textttRS-DisRL-V, a model-free approach for general value function approximation.
arXiv Detail & Related papers (2024-02-28T08:43:18Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Safe Deployment for Counterfactual Learning to Rank with Exposure-Based
Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment.
Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Risk-Averse MDPs under Reward Ambiguity [9.929659318167731]
We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity.
A scalable first-order algorithm is designed to solve large-scale problems.
arXiv Detail & Related papers (2023-01-03T11:06:30Z) - Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions.
Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting.
We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z) - Off-Policy Risk Assessment in Contextual Bandits [32.97618081988295]
We introduce the class of Lipschitz risk functionals, which subsumes many common functionals.
For Lipschitz risk functionals, the error in off-policy estimation is bounded by the error in off-policy estimation of the cumulative distribution function (CDF) of rewards.
We propose Off-Policy Risk Assessment (OPRA), an algorithm that estimates the target policy's CDF of rewards and generates a plug-in estimate of the risk.
arXiv Detail & Related papers (2021-04-18T23:27:40Z) - Bias-Corrected Peaks-Over-Threshold Estimation of the CVaR [2.552459629685159]
Conditional value-at-risk (CVaR) is a useful risk measure in fields such as machine learning, finance, insurance, energy, etc.
When measuring very extreme risk, the commonly used CVaR estimation method of sample averaging does not work well.
To mitigate this problem, the CVaR can be estimated by extrapolating above a lower threshold than the VaR.
arXiv Detail & Related papers (2021-03-08T20:29:06Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.