What are the best systems? New perspectives on NLP Benchmarking
- URL: http://arxiv.org/abs/2202.03799v2
- Date: Thu, 10 Feb 2022 11:22:35 GMT
- Title: What are the best systems? New perspectives on NLP Benchmarking
- Authors: Pierre Colombo and Nathan Noiry and Ekhine Irurozki and Stephan
Clemencon
- Abstract summary: We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
- Score: 10.27421161397197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Machine Learning, a benchmark refers to an ensemble of datasets associated
with one or multiple metrics together with a way to aggregate different systems
performances. They are instrumental in (i) assessing the progress of new
methods along different axes and (ii) selecting the best systems for practical
use. This is particularly the case for NLP with the development of large
pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a
variety of tasks. While the community mainly focused on developing new datasets
and metrics, there has been little interest in the aggregation procedure, which
is often reduced to a simple average over various performance measures.
However, this procedure can be problematic when the metrics are on a different
scale, which may lead to spurious conclusions. This paper proposes a new
procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained
through aggregating the rankings induced by each task and is theoretically
grounded. We conduct extensive numerical experiments (on over 270k scores) to
assess the soundness of our approach both on synthetic and real scores (e.g.
GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method
yields different conclusions on state-of-the-art systems than the
mean-aggregation procedure while being both more reliable and robust.
Related papers
- MISS: Multiclass Interpretable Scoring Systems [13.902264070785986]
We present a machine-learning approach for constructing Multiclass Interpretable Scoring Systems (MISS)
MISS is a fully data-driven methodology for single, sparse, and user-friendly scoring systems for multiclass classification problems.
Results indicate that our approach is competitive with other machine learning models in terms of classification performance metrics and provides well-calibrated class probabilities.
arXiv Detail & Related papers (2024-01-10T10:57:12Z) - Towards More Robust NLP System Evaluation: Handling Missing Scores in
Benchmarks [9.404931130084803]
This paper formalizes an existing problem in NLP research: benchmarking when some systems scores are missing on the task.
We introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks.
arXiv Detail & Related papers (2023-05-17T15:20:31Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - Accounting for multiplicity in machine learning benchmark performance [0.0]
Using the highest-ranked performance as an estimate for state-of-the-art (SOTA) performance is a biased estimator, giving overly optimistic results.
In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided.
arXiv Detail & Related papers (2023-03-10T10:32:18Z) - Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory.
We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z) - Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.