Vote'n'Rank: Revision of Benchmarking with Social Choice Theory
- URL: http://arxiv.org/abs/2210.05769v2
- Date: Thu, 13 Oct 2022 09:12:16 GMT
- Title: Vote'n'Rank: Revision of Benchmarking with Social Choice Theory
- Authors: Mark Rofin, Vladislav Mikhailov, Mikhail Florinskiy, Andrey
Kravchenko, Elena Tutubalina, Tatiana Shavrina, Daniel Karabekyan, Ekaterina
Artemova
- Abstract summary: This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory.
We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
- Score: 7.224599819499157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of state-of-the-art systems in different applied areas of
machine learning (ML) is driven by benchmarks, which have shaped the paradigm
of evaluating generalisation capabilities from multiple perspectives. Although
the paradigm is shifting towards more fine-grained evaluation across diverse
tasks, the delicate question of how to aggregate the performances has received
particular interest in the community. In general, benchmarks follow the
unspoken utilitarian principles, where the systems are ranked based on their
mean average score over task-specific metrics. Such aggregation procedure has
been viewed as a sub-optimal evaluation protocol, which may have created the
illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking
systems in multi-task benchmarks under the principles of the social choice
theory. We demonstrate that our approach can be efficiently utilised to draw
new insights on benchmarking in several ML sub-fields and identify the
best-performing systems in research and development case studies. The
Vote'n'Rank's procedures are more robust than the mean average while being able
to handle missing performance scores and determine conditions under which the
system becomes the winner.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Full Stage Learning to Rank: A Unified Framework for Multi-Stage Systems [40.199257203898846]
We propose an improved ranking principle for multi-stage systems, namely the Generalized Probability Ranking Principle (GPRP)
GPRP emphasizes both the selection bias in each stage of the system pipeline as well as the underlying interest of users.
Our core idea is to first estimate the selection bias in the subsequent stages and then learn a ranking model that best complies with the downstream modules' selection bias.
arXiv Detail & Related papers (2024-05-08T06:35:04Z) - When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards [9.751405901938895]
We show that under existing leaderboards, the relative performance of LLMs is highly sensitive to minute details.
We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions.
arXiv Detail & Related papers (2024-02-01T19:12:25Z) - Hierarchical Evaluation Framework: Best Practices for Human Evaluation [17.91641890651225]
The absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards.
We develop our own hierarchical evaluation framework to provide a more comprehensive representation of the NLP system's performance.
In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
arXiv Detail & Related papers (2023-10-03T09:46:02Z) - Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking.
xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics.
We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.