Towards More Robust NLP System Evaluation: Handling Missing Scores in
Benchmarks
- URL: http://arxiv.org/abs/2305.10284v1
- Date: Wed, 17 May 2023 15:20:31 GMT
- Title: Towards More Robust NLP System Evaluation: Handling Missing Scores in
Benchmarks
- Authors: Anas Himmi and Ekhine Irurozki and Nathan Noiry and Stephan Clemencon
and Pierre Colombo
- Abstract summary: This paper formalizes an existing problem in NLP research: benchmarking when some systems scores are missing on the task.
We introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks.
- Score: 9.404931130084803
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The evaluation of natural language processing (NLP) systems is crucial for
advancing the field, but current benchmarking approaches often assume that all
systems have scores available for all tasks, which is not always practical. In
reality, several factors such as the cost of running baseline, private systems,
computational limitations, or incomplete data may prevent some systems from
being evaluated on entire tasks. This paper formalize an existing problem in
NLP research: benchmarking when some systems scores are missing on the task,
and proposes a novel approach to address it. Our method utilizes a compatible
partial ranking approach to impute missing data, which is then aggregated using
the Borda count method. It includes two refinements designed specifically for
scenarios where either task-level or instance-level scores are available. We
also introduce an extended benchmark, which contains over 131 million scores,
an order of magnitude larger than existing benchmarks. We validate our methods
and demonstrate their effectiveness in addressing the challenge of missing
system evaluation on an entire task. This work highlights the need for more
comprehensive benchmarking approaches that can handle real-world scenarios
where not all systems are evaluated on the entire task.
Related papers
- SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems.
We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models.
Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions.
We propose a novel method to address this challenge.
We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory.
We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z) - Are We There Yet? A Decision Framework for Replacing Term Based
Retrieval with Dense Retrieval Systems [35.77217529138364]
Several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval.
DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search.
It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes.
arXiv Detail & Related papers (2022-06-26T23:16:05Z) - What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - ESBM: An Entity Summarization BenchMark [20.293900908253544]
We create an Entity Summarization BenchMark (ESBM) which overcomes the limitations of existing benchmarks and meets standard desiderata for a benchmark.
Considering all of these systems are unsupervised, we also implement and evaluate a supervised learning based system for reference.
arXiv Detail & Related papers (2020-03-08T07:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.