Related papers: From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

URL: http://arxiv.org/abs/2406.11939v2
Date: Mon, 14 Oct 2024 18:11:58 GMT
Title: From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Authors: Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica,
Abstract summary: BenchBuilder is an automated pipeline that curates high-quality, open-ended prompts from large, crowd-sourced datasets. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.
Score: 47.19203597218352
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

Related papers

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z)
InfoSynth: Information-Guided Benchmark Synthesis for LLMs [69.80981631587501]
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation.<n>Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming.<n>This work introduces Info Synth, a novel framework for automatically generating and evaluating reasoning benchmarks.
arXiv Detail & Related papers (2026-01-02T05:26:27Z)
ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation [33.22383550511664]
ArenaBencher is a model-agnostic framework for automatic benchmark evolution.<n>We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains.
arXiv Detail & Related papers (2025-10-09T17:59:55Z)
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models [66.51871176061195]
Decentralized Arena (dearena) is a fully automated framework leveraging collective intelligence from all large language models to evaluate each other.<n> dearena attains up to 97% correlation with human judgements, while significantly reducing the cost.
arXiv Detail & Related papers (2025-05-19T07:34:25Z)
YourBench: Easy Custom Evaluation Sets for Everyone [12.995134931278056]
YourBench is a novel, open-source framework for evaluating large language models (LLMs) It generates reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces.
arXiv Detail & Related papers (2025-04-02T15:40:24Z)
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z)
AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models [84.65095045762524]
We present three desiderata for a good benchmark for language models. benchmark reveals new trends in model rankings not shown by previous benchmarks. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
arXiv Detail & Related papers (2024-07-11T10:03:47Z)
LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z)
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild [57.272096543738336]
We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. We have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs.
arXiv Detail & Related papers (2024-06-07T09:15:44Z)
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z)
Efficient Lifelong Model Evaluation in an Era of Rapid Progress [40.57576540258748]
We introduce Sort & Search (S&S), which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours on a single A100 GPU, with low approximation error and memory cost of 100MB. Our work highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics.
arXiv Detail & Related papers (2024-02-29T18:58:26Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches. Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.