Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes
- URL: http://arxiv.org/abs/2404.00099v1
- Date: Fri, 29 Mar 2024 18:11:49 GMT
- Title: Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes
- Authors: Andrew Bennett, Nathan Kallus, Miruna Oprescu, Wen Sun, Kaiwen Wang,
- Abstract summary: We study evaluating a policy under best- and worst-case perturbations to a Markov decision process (MDP)
This is an important problem when there is the possibility of a shift between historical and future environments.
We propose a perturbation model that can modify transition kernel densities up to a given multiplicative factor or its reciprocal.
- Score: 44.974100402600165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study evaluating a policy under best- and worst-case perturbations to a Markov decision process (MDP), given transition observations from the original MDP, whether under the same or different policy. This is an important problem when there is the possibility of a shift between historical and future environments, due to e.g. unmeasured confounding, distributional shift, or an adversarial environment. We propose a perturbation model that can modify transition kernel densities up to a given multiplicative factor or its reciprocal, which extends the classic marginal sensitivity model (MSM) for single time step decision making to infinite-horizon RL. We characterize the sharp bounds on policy value under this model, that is, the tightest possible bounds given by the transition observations from the original MDP, and we study the estimation of these bounds from such transition observations. We develop an estimator with several appealing guarantees: it is semiparametrically efficient, and remains so even when certain necessary nuisance functions such as worst-case Q-functions are estimated at slow nonparametric rates. It is also asymptotically normal, enabling easy statistical inference using Wald confidence intervals. In addition, when certain nuisances are estimated inconsistently we still estimate a valid, albeit possibly not sharp bounds on the policy value. We validate these properties in numeric simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisance-function estimation (orthogonality), and accounting for having only finite samples to learn from (inference) together leads to credible and reliable policy evaluation.
Related papers
- Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning [12.490614705930676]
Current approaches to model-based offline Reinforcement Learning (RL) often incorporate uncertainty-based reward penalization.
We argue that this penalization introduces excessive conservatism, potentially resulting in suboptimal policies through underestimation.
We identify as an important cause of over-penalization the lack of a reliable uncertainty estimator capable of propagating uncertainties in the Bellman operator.
arXiv Detail & Related papers (2024-06-06T13:58:41Z) - Relaxed Quantile Regression: Prediction Intervals for Asymmetric Noise [51.87307904567702]
Quantile regression is a leading approach for obtaining such intervals via the empirical estimation of quantiles in the distribution of outputs.
We propose Relaxed Quantile Regression (RQR), a direct alternative to quantile regression based interval construction that removes this arbitrary constraint.
We demonstrate that this added flexibility results in intervals with an improvement in desirable qualities.
arXiv Detail & Related papers (2024-06-05T13:36:38Z) - Off-Policy Evaluation in Markov Decision Processes under Weak
Distributional Overlap [5.0401589279256065]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.
We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Distributional Shift-Aware Off-Policy Interval Estimation: A Unified
Error Quantification Framework [8.572441599469597]
We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes.
The objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies.
We show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings.
arXiv Detail & Related papers (2023-09-23T06:35:44Z) - The Implicit Delta Method [61.36121543728134]
In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of uncertainty.
We show that the change in the evaluation due to regularization is consistent for the variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference.
arXiv Detail & Related papers (2022-11-11T19:34:17Z) - Robust Anytime Learning of Markov Decision Processes [8.799182983019557]
In data-driven applications, deriving precise probabilities from limited data introduces statistical errors.
Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions.
We propose a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies.
arXiv Detail & Related papers (2022-05-31T14:29:55Z) - Sampling-Based Robust Control of Autonomous Systems with Non-Gaussian
Noise [59.47042225257565]
We present a novel planning method that does not rely on any explicit representation of the noise distributions.
First, we abstract the continuous system into a discrete-state model that captures noise by probabilistic transitions between states.
We capture these bounds in the transition probability intervals of a so-called interval Markov decision process (iMDP)
arXiv Detail & Related papers (2021-10-25T06:18:55Z) - Near Optimality of Finite Memory Feedback Policies in Partially Observed
Markov Decision Processes [0.0]
We study a planning problem for POMDPs where the system dynamics and measurement channel model is assumed to be known.
We find optimal policies for the approximate belief model under mild non-linear filter stability conditions.
We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound.
arXiv Detail & Related papers (2020-10-15T00:37:51Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.