Mapping global dynamics of benchmark creation and saturation in
artificial intelligence
- URL: http://arxiv.org/abs/2203.04592v1
- Date: Wed, 9 Mar 2022 09:16:49 GMT
- Title: Mapping global dynamics of benchmark creation and saturation in
artificial intelligence
- Authors: Adriano Barbosa-Silva, Simon Ott, Kathrin Blagec, Jan Brauner,
Matthias Samwald
- Abstract summary: We create maps of the global dynamics of benchmark creation and saturation.
We curated data for 1688 benchmarks covering the entire domains of computer vision and natural language processing.
- Score: 5.233652342195164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Benchmarks are crucial to measuring and steering progress in artificial
intelligence (AI). However, recent studies raised concerns over the state of AI
benchmarking, reporting issues such as benchmark overfitting, benchmark
saturation and increasing centralization of benchmark dataset creation. To
facilitate monitoring of the health of the AI benchmarking ecosystem, we
introduce methodologies for creating condensed maps of the global dynamics of
benchmark creation and saturation. We curated data for 1688 benchmarks covering
the entire domains of computer vision and natural language processing, and show
that a large fraction of benchmarks quickly trended towards near-saturation,
that many benchmarks fail to find widespread utilization, and that benchmark
performance gains for different AI tasks were prone to unforeseen bursts. We
conclude that future work should focus on large-scale community collaboration
and on mapping benchmark performance gains to real-world utility and impact of
AI.
Related papers
- Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation [2.2241228857601727]
This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices.
It brings together many fine-grained issues in the design and application of benchmarks with broader sociotechnical issues.
Our review also highlights a series of systemic flaws in current practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results.
arXiv Detail & Related papers (2025-02-10T15:25:06Z) - More than Marketing? On the Information Value of AI Benchmarks for Practitioners [42.73526862595375]
In academia, public benchmarks were generally viewed as suitable measures for capturing research progress.
In product and policy, benchmarks were often found to be inadequate for informing substantive decisions.
We conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals.
arXiv Detail & Related papers (2024-12-07T03:35:39Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.
We conduct the first large-scale, multi-benchmark web agent experiment.
Results highlight a large discrepancy between OpenAI and Anthropic's latests models.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - IM-IAD: Industrial Image Anomaly Detection Benchmark in Manufacturing [88.35145788575348]
Image anomaly detection (IAD) is an emerging and vital computer vision task in industrial manufacturing.
The lack of a uniform IM benchmark is hindering the development and usage of IAD methods in real-world applications.
We construct a comprehensive image anomaly detection benchmark (IM-IAD), which includes 19 algorithms on seven major datasets.
arXiv Detail & Related papers (2023-01-31T01:24:45Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Benchmarking Node Outlier Detection on Graphs [90.29966986023403]
Graph outlier detection is an emerging but crucial machine learning task with numerous applications.
We present the first comprehensive unsupervised node outlier detection benchmark for graphs called UNOD.
arXiv Detail & Related papers (2022-06-21T01:46:38Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Benchmarking Graph Neural Networks [75.42159546060509]
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs.
For any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress.
GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework.
arXiv Detail & Related papers (2020-03-02T15:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.