Position: Benchmarking is Limited in Reinforcement Learning Research
- URL: http://arxiv.org/abs/2406.16241v1
- Date: Sun, 23 Jun 2024 23:36:26 GMT
- Title: Position: Benchmarking is Limited in Reinforcement Learning Research
- Authors: Scott M. Jordan, Adam White, Bruno Castro da Silva, Martha White, Philip S. Thomas,
- Abstract summary: This work investigates the sources of increased computation costs in rigorous experiment designs.
We argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.
- Score: 33.596940437995904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.
Related papers
- A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [29.437125712259046]
Reasoning has emerged as the next major frontier for language models (LMs)
We conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices.
We propose a standardized evaluation framework with clearly defined best practices and reporting standards.
arXiv Detail & Related papers (2025-04-09T17:58:17Z) - On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations [53.0667196725616]
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment.
DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games.
Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist.
arXiv Detail & Related papers (2025-03-28T16:25:06Z) - Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.
A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.
This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.
We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z) - AExGym: Benchmarks and Environments for Adaptive Experimentation [7.948144726705323]
We present a benchmark for adaptive experimentation based on real-world datasets.
We highlight prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity.
arXiv Detail & Related papers (2024-08-08T15:32:12Z) - Adaptive Experimentation When You Can't Experiment [55.86593195947978]
This paper introduces the emphconfounded pure exploration transductive linear bandit (textttCPET-LB) problem.
Online services can employ a properly randomized encouragement that incentivizes users toward a specific treatment.
arXiv Detail & Related papers (2024-06-15T20:54:48Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - Re-Benchmarking Pool-Based Active Learning for Binary Classification [27.034593234956713]
Active learning is a paradigm that significantly enhances the performance of machine learning models when acquiring labeled data.
While several benchmarks exist for evaluating active learning strategies, their findings exhibit some misalignment.
This discrepancy motivates us to develop a transparent and reproducible benchmark for the community.
arXiv Detail & Related papers (2023-06-15T08:47:50Z) - Task-specific experimental design for treatment effect estimation [59.879567967089145]
Large randomised trials (RCTs) are the standard for causal inference.
Recent work has proposed more sample-efficient alternatives to RCTs, but these are not adaptable to the downstream application for which the causal effect is sought.
We develop a task-specific approach to experimental design and derive sampling strategies customised to particular downstream applications.
arXiv Detail & Related papers (2023-06-08T18:10:37Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Adaptive Experimentation at Scale: A Computational Framework for
Flexible Batches [7.390918770007728]
Motivated by practical instances involving a handful of reallocations in which outcomes are measured in batches, we develop an adaptive-driven experimentation framework.
Our main observation is that normal approximations, which are universal in statistical inference, can also guide the design of adaptive algorithms.
arXiv Detail & Related papers (2023-03-21T04:17:03Z) - Benchopt: Reproducible, efficient and collaborative optimization
benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning.
Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.