Can we hop in general? A discussion of benchmark selection and design using the Hopper environment
- URL: http://arxiv.org/abs/2410.08870v2
- Date: Mon, 14 Oct 2024 13:28:20 GMT
- Title: Can we hop in general? A discussion of benchmark selection and design using the Hopper environment
- Authors: Claas A Voelcker, Marcel Hussing, Eric Eaton,
- Abstract summary: We argue that benchmarking in reinforcement learning needs to be treated as a scientific discipline itself.
Case study shows that the selection of standard benchmarking suites can drastically change how we judge performance of algorithms.
- Score: 12.18012293738896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Empirical, benchmark-driven testing is a fundamental paradigm in the current RL community. While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common practice, this choice is rarely discussed. Benchmark choices are often done based on intuitive ideas like "legged robots" or "visual observations". In this paper, we argue that benchmarking in RL needs to be treated as a scientific discipline itself. To illustrate our point, we present a case study on different variants of the Hopper environment to show that the selection of standard benchmarking suites can drastically change how we judge performance of algorithms. The field does not have a cohesive notion of what the different Hopper environments are representative - they do not even seem to be representative of each other. Our experimental results suggests a larger issue in the deep RL literature: benchmark choices are neither commonly justified, nor does there exist a language that could be used to justify the selection of certain environments. This paper concludes with a discussion of the requirements for proper discussion and evaluations of benchmarks and recommends steps to start a dialogue towards this goal.
Related papers
- OGBench: Benchmarking Offline Goal-Conditioned RL [72.00291801676684]
offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning.
We propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL.
arXiv Detail & Related papers (2024-10-26T06:06:08Z) - Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z) - Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards [9.751405901938895]
We show that under existing leaderboards, the relative performance of LLMs is highly sensitive to minute details.
We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions.
arXiv Detail & Related papers (2024-02-01T19:12:25Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Benchopt: Reproducible, efficient and collaborative optimization
benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning.
Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z) - A Theoretically Grounded Benchmark for Evaluating Machine Commonsense [6.725087407394836]
Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense.
TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs.
Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
arXiv Detail & Related papers (2022-03-23T04:06:01Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - Do Fine-tuned Commonsense Language Models Really Generalize? [8.591839265985412]
We study the generalization issue in detail by designing and conducting a rigorous scientific study.
We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup.
arXiv Detail & Related papers (2020-11-18T08:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.