Accelerated Policy Evaluation: Learning Adversarial Environments with
Adaptive Importance Sampling
- URL: http://arxiv.org/abs/2106.10566v1
- Date: Sat, 19 Jun 2021 20:03:26 GMT
- Title: Accelerated Policy Evaluation: Learning Adversarial Environments with
Adaptive Importance Sampling
- Authors: Mengdi Xu, Peide Huang, Fengpei Li, Jiacheng Zhu, Xuewei Qi, Kentaro
Oguchi, Zhiyuan Huang, Henry Lam, Ding Zhao
- Abstract summary: A biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures.
We propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability.
APE is scalable to large discrete or continuous spaces by incorporating function approximators.
- Score: 19.81658135871748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The evaluation of rare but high-stakes events remains one of the main
difficulties in obtaining reliable policies from intelligent agents, especially
in large or continuous state/action spaces where limited scalability enforces
the use of a prohibitively large number of testing iterations. On the other
hand, a biased or inaccurate policy evaluation in a safety-critical system
could potentially cause unexpected catastrophic failures during deployment. In
this paper, we propose the Accelerated Policy Evaluation (APE) method, which
simultaneously uncovers rare events and estimates the rare event probability in
Markov decision processes. The APE method treats the environment nature as an
adversarial agent and learns towards, through adaptive importance sampling, the
zero-variance sampling distribution for the policy evaluation. Moreover, APE is
scalable to large discrete or continuous spaces by incorporating function
approximators. We investigate the convergence properties of proposed algorithms
under suitable regularity conditions. Our empirical studies show that APE
estimates rare event probability with a smaller variance while only using
orders of magnitude fewer samples compared to baseline methods in both
multi-agent and single-agent environments.
Related papers
- Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems.
Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process.
This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z) - Certifiably Robust Policies for Uncertain Parametric Environments [57.2416302384766]
We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters.
We learn and analyse IMDPs for a set of unknown sample environments induced by parameters.
We show that our approach produces tight bounds on a policy's performance with high confidence.
arXiv Detail & Related papers (2024-08-06T10:48:15Z) - Probabilistic Offline Policy Ranking with Approximate Bayesian
Computation [4.919605764492689]
It is essential to compare and rank candidate policies offline before real-world deployment for safety and reliability.
We present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems.
POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst, best, and average-cases.
arXiv Detail & Related papers (2023-12-17T05:22:44Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - A Deep Reinforcement Learning Approach to Rare Event Estimation [30.670114229970526]
An important step in the design of autonomous systems is to evaluate the probability that a failure will occur.
In safety-critical domains, the failure probability is extremely small so that the evaluation of a policy through Monte Carlo sampling is inefficient.
We develop two adaptive importance sampling algorithms that can efficiently estimate the probability of rare events for sequential decision making systems.
arXiv Detail & Related papers (2022-11-22T18:29:14Z) - A Risk-Sensitive Approach to Policy Optimization [21.684251937825234]
Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy.
We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized.
We demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies.
arXiv Detail & Related papers (2022-08-19T00:55:05Z) - Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking.
They can both improve outcomes for study participants and increase the chance of identifying good or even best policies.
To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z) - Quantifying Uncertainty in Deep Spatiotemporal Forecasting [67.77102283276409]
We describe two types of forecasting problems: regular grid-based and graph-based.
We analyze UQ methods from both the Bayesian and the frequentist point view, casting in a unified framework via statistical decision theory.
Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical computational trade-offs for different UQ methods.
arXiv Detail & Related papers (2021-05-25T14:35:46Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Conformal Inference of Counterfactuals and Individual Treatment Effects [6.810856082577402]
We propose a conformal inference-based approach that can produce reliable interval estimates for counterfactuals and individual treatment effects.
Existing methods suffer from a significant coverage deficit even in simple models.
arXiv Detail & Related papers (2020-06-11T01:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.