Deep Reinforcement Learning at the Edge of the Statistical Precipice
- URL: http://arxiv.org/abs/2108.13264v1
- Date: Mon, 30 Aug 2021 14:23:48 GMT
- Title: Deep Reinforcement Learning at the Edge of the Statistical Precipice
- Authors: Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville,
Marc G. Bellemare
- Abstract summary: We argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field.
We advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results.
- Score: 31.178451465925555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep reinforcement learning (RL) algorithms are predominantly evaluated by
comparing their relative performance on a large suite of tasks. Most published
results on deep RL benchmarks compare point estimates of aggregate performance
such as mean and median scores across tasks, ignoring the statistical
uncertainty implied by the use of a finite number of training runs. Beginning
with the Arcade Learning Environment (ALE), the shift towards
computationally-demanding benchmarks has led to the practice of evaluating only
a small number of runs per task, exacerbating the statistical uncertainty in
point estimates. In this paper, we argue that reliable evaluation in the few
run deep RL regime cannot ignore the uncertainty in results without running the
risk of slowing down progress in the field. We illustrate this point using a
case study on the Atari 100k benchmark, where we find substantial discrepancies
between conclusions drawn from point estimates alone versus a more thorough
statistical analysis. With the aim of increasing the field's confidence in
reported results with a handful of runs, we advocate for reporting interval
estimates of aggregate performance and propose performance profiles to account
for the variability in results, as well as present more robust and efficient
aggregate metrics, such as interquartile mean scores, to achieve small
uncertainty in results. Using such statistical tools, we scrutinize performance
evaluations of existing algorithms on other widely used RL benchmarks including
the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies
in prior comparisons. Our findings call for a change in how we evaluate
performance in deep RL, for which we present a more rigorous evaluation
methodology, accompanied with an open-source library rliable, to prevent
unreliable results from stagnating the field.
Related papers
- Active Evaluation Acquisition for Efficient LLM Benchmarking [18.85604491151409]
We investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy.
Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples.
Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required.
arXiv Detail & Related papers (2024-10-08T12:08:46Z) - Assessing the Impact of Distribution Shift on Reinforcement Learning
Performance [0.0]
Reinforcement learning (RL) faces its own set of unique challenges.
Comparison of point estimates, and plots that show successful convergence to the optimal policy during training, may obfuscate overfitting or dependence on the experimental setup.
We propose a set of evaluation methods that measure the robustness of RL algorithms under distribution shifts.
arXiv Detail & Related papers (2024-02-05T23:50:55Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Out-of-Distribution Detection with Hilbert-Schmidt Independence
Optimization [114.43504951058796]
Outlier detection tasks have been playing a critical role in AI safety.
Deep neural network classifiers usually tend to incorrectly classify out-of-distribution (OOD) inputs into in-distribution classes with high confidence.
We propose an alternative probabilistic paradigm that is both practically useful and theoretically viable for the OOD detection tasks.
arXiv Detail & Related papers (2022-09-26T15:59:55Z) - RIFLE: Imputation and Robust Inference from Low Order Marginals [10.082738539201804]
We develop a statistical inference framework for regression and classification in the presence of missing data without imputation.
Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model.
Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
arXiv Detail & Related papers (2021-09-01T23:17:30Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Performance Evaluation of Adversarial Attacks: Discrepancies and
Solutions [51.8695223602729]
adversarial attack methods have been developed to challenge the robustness of machine learning models.
We propose a Piece-wise Sampling Curving (PSC) toolkit to effectively address the discrepancy.
PSC toolkit offers options for balancing the computational cost and evaluation effectiveness.
arXiv Detail & Related papers (2021-04-22T14:36:51Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.