Related papers: AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents

AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents

URL: http://arxiv.org/abs/2306.10882v3
Date: Thu, 12 Dec 2024 09:03:39 GMT
Title: AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents
Authors: Timothée Mathieu, Riccardo Della Vecchia, Alena Shilova, Matheus Medeiros Centa, Hector Kohler, Odalric-Ambrym Maillard, Philippe Preux,
Abstract summary: We propose a theoretically sound methodology for comparing the performance of a set of algorithms.<n>AdaStop is a new statistical test based on multiple group sequential tests.
Score: 17.481638913280403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, the scientific community has questioned the statistical reproducibility of many empirical results, especially in the field of machine learning. To contribute to the resolution of this reproducibility crisis, we propose a theoretically sound methodology for comparing the performance of a set of algorithms. We exemplify our methodology in Deep Reinforcement Learning (Deep RL). The performance of one execution of a Deep RL algorithm is a random variable. Therefore, several independent executions are needed to evaluate its performance. When comparing algorithms with random performance, a major question concerns the number of executions to perform to ensure that the result of the comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing more than 2 algorithms at once, we have to use a multiple tests procedure to preserve low error guarantees. We introduce AdaStop, a new statistical test based on multiple group sequential tests. When used to compare algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that enough information has been collected to distinguish algorithms that have different score distributions. We prove theoretically that AdaStop has a low probability of making a (family-wise) error. We illustrate the effectiveness of AdaStop in various use-cases, including toy examples and Deep RL algorithms on challenging Mujoco environments. AdaStop is the first statistical test fitted to this sort of comparisons: it is both a significant contribution to statistics, and an important contribution to computational studies performed in reinforcement learning and in other domains.

Related papers

Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping [17.222170618610594]
This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. We show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials.
arXiv Detail & Related papers (2025-03-14T00:21:48Z)
Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z)
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models. We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
Batch Bayesian Optimization for Replicable Experimental Design [56.64902148159355]
Many real-world design problems evaluate multiple experimental conditions in parallel and replicate each condition multiple times due to large and heteroscedastic observation noise. We propose the Batch Thompson Sampling for Replicable Experimental Design framework, which encompasses three algorithms. We show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.
arXiv Detail & Related papers (2023-11-02T12:46:03Z)
Toward Cost-effective Adaptive Random Testing: An Approximate Nearest Neighbor Approach [6.431858417308008]
Adaptive Random Testing (ART) enhances the testing effectiveness (including fault-detection capability) of Random Testing (RT) Many ART algorithms have been investigated such as Fixed-Size-Candidate-Set ART (FSCS) and Restricted Random Testing (RRT)
arXiv Detail & Related papers (2023-05-27T15:37:13Z)
Sequential Kernelized Independence Testing [101.22966794822084]
We design sequential kernelized independence tests inspired by kernelized dependence measures. We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
Reliable validation of Reinforcement Learning Benchmarks [1.2031796234206134]
Reinforcement Learning (RL) is one of the most dynamic research areas in Game AI and AI as a whole. There are numerous benchmark environments whose scores are used to compare different algorithms, such as Atari. We propose improving this situation by providing access to the original experimental data to validate study results.
arXiv Detail & Related papers (2022-03-02T12:55:27Z)
Ranking with Confidence for Large Scale Comparison Data [1.2183405753834562]
In this work, we leverage a generative data model considering comparison noise to develop a fast, precise, and informative ranking from pairwise comparisons. In real data, PD-Rank requires less computational time to achieve the same Kendall algorithm than active learning methods.
arXiv Detail & Related papers (2022-02-03T16:36:37Z)
Using Sequential Statistical Tests to Improve the Performance of Random Search in hyperparameter Tuning [0.0]
Hyperparamter tuning is one of the the most time-consuming parts in machine learning. We propose a sequential testing procedure to minimize the number of resampling iterations to detect inferior parameter setting.
arXiv Detail & Related papers (2021-12-23T10:02:04Z)
Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments. We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis.
arXiv Detail & Related papers (2021-12-15T22:11:58Z)
Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z)
A Fast PC Algorithm with Reversed-order Pruning and A Parallelization Strategy [22.31288740171446]
The PC algorithm is the state-of-the-art algorithm for causal structure discovery on observational data. It can be computationally expensive in the worst case due to the conditional independence tests are performed. This makes the algorithm computationally intractable when the task contains several hundred or thousand nodes. We propose a critical observation that the conditional set rendering two nodes independent is non-unique, and including certain redundant nodes do not sacrifice result accuracy.
arXiv Detail & Related papers (2021-09-10T02:22:10Z)
Robust Predictable Control [149.71263296079388]
We show that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck. We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
arXiv Detail & Related papers (2021-09-07T17:29:34Z)
Deep Reinforcement Learning at the Edge of the Statistical Precipice [31.178451465925555]
We argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results.
arXiv Detail & Related papers (2021-08-30T14:23:48Z)
A Pragmatic Look at Deep Imitation Learning [0.3626013617212666]
We re-implement 6 different adversarial imitation learning algorithms. We evaluate them on a widely-used expert trajectory dataset. GAIL consistently performs well across a range of sample sizes.
arXiv Detail & Related papers (2021-08-04T06:33:10Z)
Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination [82.52105963476703]
A recurring theme in statistical learning, online learning, and beyond is that faster convergence rates are possible for problems with low noise. First-order guarantees are relatively well understood in statistical and online learning. We show that the logarithmic loss and an information-theoretic quantity called the triangular discrimination play a fundamental role in obtaining first-order guarantees.
arXiv Detail & Related papers (2021-07-05T19:20:34Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
DPER: Efficient Parameter Estimation for Randomly Missing Data [0.24466725954625884]
We propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set. Our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods.
arXiv Detail & Related papers (2021-06-06T16:37:48Z)
Resource Allocation in Multi-armed Bandit Exploration: Overcoming Sublinear Scaling with Adaptive Parallelism [107.48538091418412]
We study exploration in multi-armed bandits when we have access to a divisible resource that can be allocated in varying amounts to arm pulls. We focus in particular on the allocation of distributed computing resources, where we may obtain results faster by allocating more resources per pull.
arXiv Detail & Related papers (2020-10-31T18:19:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.