AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents
- URL: http://arxiv.org/abs/2306.10882v3
- Date: Thu, 12 Dec 2024 09:03:39 GMT
- Title: AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents
- Authors: Timothée Mathieu, Riccardo Della Vecchia, Alena Shilova, Matheus Medeiros Centa, Hector Kohler, Odalric-Ambrym Maillard, Philippe Preux,
- Abstract summary: We propose a theoretically sound methodology for comparing the performance of a set of algorithms.
AdaStop is a new statistical test based on multiple group sequential tests.
- Score: 17.481638913280403
- License:
- Abstract: Recently, the scientific community has questioned the statistical reproducibility of many empirical results, especially in the field of machine learning. To contribute to the resolution of this reproducibility crisis, we propose a theoretically sound methodology for comparing the performance of a set of algorithms. We exemplify our methodology in Deep Reinforcement Learning (Deep RL). The performance of one execution of a Deep RL algorithm is a random variable. Therefore, several independent executions are needed to evaluate its performance. When comparing algorithms with random performance, a major question concerns the number of executions to perform to ensure that the result of the comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing more than 2 algorithms at once, we have to use a multiple tests procedure to preserve low error guarantees. We introduce AdaStop, a new statistical test based on multiple group sequential tests. When used to compare algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that enough information has been collected to distinguish algorithms that have different score distributions. We prove theoretically that AdaStop has a low probability of making a (family-wise) error. We illustrate the effectiveness of AdaStop in various use-cases, including toy examples and Deep RL algorithms on challenging Mujoco environments. AdaStop is the first statistical test fitted to this sort of comparisons: it is both a significant contribution to statistics, and an important contribution to computational studies performed in reinforcement learning and in other domains.
Related papers
- Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.
We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models.
We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z) - Batch Bayesian Optimization for Replicable Experimental Design [56.64902148159355]
Many real-world design problems evaluate multiple experimental conditions in parallel and replicate each condition multiple times due to large and heteroscedastic observation noise.
We propose the Batch Thompson Sampling for Replicable Experimental Design framework, which encompasses three algorithms.
We show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.
arXiv Detail & Related papers (2023-11-02T12:46:03Z) - Toward Cost-effective Adaptive Random Testing: An Approximate Nearest Neighbor Approach [6.431858417308008]
Adaptive Random Testing (ART) enhances the testing effectiveness (including fault-detection capability) of Random Testing (RT)
Many ART algorithms have been investigated such as Fixed-Size-Candidate-Set ART (FSCS) and Restricted Random Testing (RRT)
arXiv Detail & Related papers (2023-05-27T15:37:13Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - A Fast PC Algorithm with Reversed-order Pruning and A Parallelization
Strategy [22.31288740171446]
The PC algorithm is the state-of-the-art algorithm for causal structure discovery on observational data.
It can be computationally expensive in the worst case due to the conditional independence tests are performed.
This makes the algorithm computationally intractable when the task contains several hundred or thousand nodes.
We propose a critical observation that the conditional set rendering two nodes independent is non-unique, and including certain redundant nodes do not sacrifice result accuracy.
arXiv Detail & Related papers (2021-09-10T02:22:10Z) - Deep Reinforcement Learning at the Edge of the Statistical Precipice [31.178451465925555]
We argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field.
We advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results.
arXiv Detail & Related papers (2021-08-30T14:23:48Z) - A Pragmatic Look at Deep Imitation Learning [0.3626013617212666]
We re-implement 6 different adversarial imitation learning algorithms.
We evaluate them on a widely-used expert trajectory dataset.
GAIL consistently performs well across a range of sample sizes.
arXiv Detail & Related papers (2021-08-04T06:33:10Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Resource Allocation in Multi-armed Bandit Exploration: Overcoming
Sublinear Scaling with Adaptive Parallelism [107.48538091418412]
We study exploration in multi-armed bandits when we have access to a divisible resource that can be allocated in varying amounts to arm pulls.
We focus in particular on the allocation of distributed computing resources, where we may obtain results faster by allocating more resources per pull.
arXiv Detail & Related papers (2020-10-31T18:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.