Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid
Progress
- URL: http://arxiv.org/abs/2402.19472v1
- Date: Thu, 29 Feb 2024 18:58:26 GMT
- Title: Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid
Progress
- Authors: Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel
Bibi, Samuel Albanie
- Abstract summary: With repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies.
In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks.
While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models.
- Score: 42.61046639944395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standardized benchmarks drive progress in machine learning. However, with
repeated testing, the risk of overfitting grows as algorithms over-exploit
benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by
compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As
exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet,
containing (for now) 1.69M and 1.98M test samples, respectively. While reducing
overfitting, lifelong benchmarks introduce a key challenge: the high cost of
evaluating a growing number of models across an ever-expanding sample set. To
address this challenge, we also introduce an efficient evaluation framework:
Sort \& Search (S&S), which reuses previously evaluated models by leveraging
dynamic programming algorithms to selectively rank and sub-select test samples,
enabling cost-effective lifelong benchmarking. Extensive empirical evaluations
across 31,000 models demonstrate that S&S achieves highly-efficient approximate
accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours
(1000x reduction) on a single A100 GPU, with low approximation error. As such,
lifelong benchmarks offer a robust, practical solution to the "benchmark
exhaustion" problem.
Related papers
- HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We present HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.
We find that synthetic tasks like NIAH are not good predictors of downstream performance.
While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z) - From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline [47.19203597218352]
BenchBuilder is an automated pipeline that curates high-quality, open-ended prompts from large, crowd-sourced datasets.
We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder.
Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.
arXiv Detail & Related papers (2024-06-17T17:26:10Z) - It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives [40.197673152937256]
Training of statistical performance models often requires vast amounts of data, leading to a significant time investment and can be difficult in case of limited hardware availability.
We propose a novel performance modeling methodology that significantly reduces the number of training samples while maintaining good accuracy.
We achieve a Mean Absolute Percentage Error (MAPE) of as low as 0.02% for single-layer estimations and 0.68% for whole estimations with less than 10000 training samples.
arXiv Detail & Related papers (2024-06-12T15:34:28Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Real-Time Visual Feedback to Guide Benchmark Creation: A
Human-and-Metric-in-the-Loop Workflow [22.540665278228975]
We propose VAIDA, a novel benchmark creation paradigm for NLP.
VAIDA focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies.
We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts.
arXiv Detail & Related papers (2023-02-09T04:43:10Z) - Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches.
Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.