Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
- URL: http://arxiv.org/abs/2504.01001v1
- Date: Tue, 01 Apr 2025 17:40:08 GMT
- Title: Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
- Authors: José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins,
- Abstract summary: We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task.<n>ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation.<n>It is scalable to tasks and languages where collecting real-world data is costly or impractical.
- Score: 24.481028155002523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.
Related papers
- MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)<n>MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.<n>We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - Multilingual European Language Models: Benchmarking Approaches and Challenges [2.413212225810367]
generative large language models (LLMs) can solve different tasks through chat interaction.<n>This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks.<n>We discuss potential solutions to enhance translation quality and cultural biases, including human-in-the-loop verification and iterative translation ranking.
arXiv Detail & Related papers (2025-02-18T14:32:17Z) - BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - KOBEST: Korean Balanced Evaluation of Significant Tasks [3.664687661363732]
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field.
We propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks.
arXiv Detail & Related papers (2022-04-09T20:13:51Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.