Related papers: Empirical Design in Reinforcement Learning

Empirical Design in Reinforcement Learning

URL: http://arxiv.org/abs/2304.01315v2
Date: Tue, 29 Oct 2024 17:44:21 GMT
Title: Empirical Design in Reinforcement Learning
Authors: Andrew Patterson, Samuel Neumann, Martha White, Adam White,
Abstract summary: It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
Score: 23.873958977534993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyper-parameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018). Here we take this one step further. This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyper-parameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.

Related papers

Towards Explainable Test Case Prioritisation with Learning-to-Rank Models [6.289767078502329]
Test case prioritisation ( TCP) is a critical task in regression testing to ensure quality as software evolves. We present and discuss scenarios that require different explanations and how the particularities of TCP could influence them.
arXiv Detail & Related papers (2024-05-22T16:11:45Z)
TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version) [18.146377453918724]
Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods. This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task.
arXiv Detail & Related papers (2024-02-02T12:27:32Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
MetaKernel: Learning Variational Random Features with Limited Labels [120.90737681252594]
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. We propose meta-learning kernels with random Fourier features for few-shot learning, we call Meta Kernel.
arXiv Detail & Related papers (2021-05-08T21:24:09Z)
Demystification of Few-shot and One-shot Learning [63.58514532659252]
Few-shot and one-shot learning have been the subject of active and intensive research in recent years. We show that if the ambient or latent decision space of a learning machine is sufficiently high-dimensional than a large class of objects in this space can indeed be easily learned from few examples.
arXiv Detail & Related papers (2021-04-25T14:47:05Z)
Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments [11.464963616709671]
Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments. We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes. We show that collecting data with TS can as much as double the False Positive Rate (FPR) and the False Negative Rate (FNR)
arXiv Detail & Related papers (2021-03-22T22:05:18Z)
Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks. We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks. We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z)
With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements. Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z)
What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation [37.5845376458136]
Deep learning algorithms are well-known to have a propensity for fitting the training data very well. Such fitting requires memorization of training data labels. We propose a theoretical explanation for this phenomenon based on a combination of two insights.
arXiv Detail & Related papers (2020-08-09T10:12:28Z)
Showing Your Work Doesn't Always Work [73.63200097493576]
"Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model. We analytically show that their estimator is biased and uses error-prone assumptions. We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
arXiv Detail & Related papers (2020-04-28T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.