Probabilistic Offline Policy Ranking with Approximate Bayesian
Computation
- URL: http://arxiv.org/abs/2312.11551v1
- Date: Sun, 17 Dec 2023 05:22:44 GMT
- Title: Probabilistic Offline Policy Ranking with Approximate Bayesian
Computation
- Authors: Longchao Da, Porter Jenkins, Trevor Schwantes, Jeffrey Dotson, Hua Wei
- Abstract summary: It is essential to compare and rank candidate policies offline before real-world deployment for safety and reliability.
We present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems.
POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst, best, and average-cases.
- Score: 4.919605764492689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In practice, it is essential to compare and rank candidate policies offline
before real-world deployment for safety and reliability. Prior work seeks to
solve this offline policy ranking (OPR) problem through value-based methods,
such as Off-policy evaluation (OPE). However, they fail to analyze special
cases performance (e.g., worst or best cases), due to the lack of holistic
characterization of policies performance. It is even more difficult to estimate
precise policy values when the reward is not fully accessible under sparse
settings. In this paper, we present Probabilistic Offline Policy Ranking
(POPR), a framework to address OPR problems by leveraging expert data to
characterize the probability of a candidate policy behaving like experts, and
approximating its entire performance posterior distribution to help with
ranking. POPR does not rely on value estimation, and the derived performance
posterior can be used to distinguish candidates in worst, best, and
average-cases. To estimate the posterior, we propose POPR-EABC, an Energy-based
Approximate Bayesian Computation (ABC) method conducting likelihood-free
inference. POPR-EABC reduces the heuristic nature of ABC by a smooth energy
function, and improves the sampling efficiency by a pseudo-likelihood. We
empirically demonstrate that POPR-EABC is adequate for evaluating policies in
both discrete and continuous action spaces across various experiment
environments, and facilitates probabilistic comparisons of candidate policies
before deployment.
Related papers
- OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It [20.312864152544954]
We show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure.
We propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously.
arXiv Detail & Related papers (2024-04-23T14:34:16Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Offline Policy Evaluation and Optimization under Confounding [35.778917456294046]
We map out the landscape of offline policy evaluation for confounded MDPs.
We characterize settings where consistent value estimates are provably not achievable.
We present new algorithms for offline policy improvement and prove local convergence guarantees.
arXiv Detail & Related papers (2022-11-29T20:45:08Z) - Conformal Off-Policy Prediction in Contextual Bandits [54.67508891852636]
Conformal off-policy prediction can output reliable predictive intervals for the outcome under a new target policy.
We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup.
arXiv Detail & Related papers (2022-06-09T10:39:33Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - Accelerated Policy Evaluation: Learning Adversarial Environments with
Adaptive Importance Sampling [19.81658135871748]
A biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures.
We propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability.
APE is scalable to large discrete or continuous spaces by incorporating function approximators.
arXiv Detail & Related papers (2021-06-19T20:03:26Z) - Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.