Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
Language Processing Leaderboards
- URL: http://arxiv.org/abs/2303.10888v1
- Date: Mon, 20 Mar 2023 06:13:03 GMT
- Title: Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
Language Processing Leaderboards
- Authors: Chanjun Park, Hyeonseok Moon, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo
and Heuiseok Lim
- Abstract summary: We argue that evaluation on a given test dataset is just one of many performance indications of the model.
We propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system.
- Score: 5.919860270977038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leaderboard systems allow researchers to objectively evaluate Natural
Language Processing (NLP) models and are typically used to identify models that
exhibit superior performance on a given task in a predetermined setting.
However, we argue that evaluation on a given test dataset is just one of many
performance indications of the model. In this paper, we claim leaderboard
competitions should also aim to identify models that exhibit the best
performance in a real-world setting. We highlight three issues with current
leaderboard systems: (1) the use of a single, static test set, (2) discrepancy
between testing and real-world application (3) the tendency for
leaderboard-centric competition to be biased towards the test set. As a
solution, we propose a new paradigm of leaderboard systems that addresses these
issues of current leaderboard system. Through this study, we hope to induce a
paradigm shift towards more real -world-centric leaderboard competitions.
Related papers
- Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards [67.65408769829524]
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods.
The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually.
automatic leaderboard construction has emerged as a solution to reduce manual labor.
arXiv Detail & Related papers (2024-09-19T11:12:27Z) - Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024 [0.0]
This paper describes our systems for the sub-task I in the Software Mention Detection in Scholarly Publications shared-task.
Our best performing system addresses the named entity recognition problem through a three-stage framework.
Our framework based on the XLM-R-based model achieves a weighted F1-score of 67.80%, delivering our team the 3rd rank in Sub-task I for the Software Mention Recognition task.
arXiv Detail & Related papers (2024-04-23T17:06:24Z) - When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards [9.751405901938895]
We show that under existing leaderboards, the relative performance of LLMs is highly sensitive to minute details.
We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions.
arXiv Detail & Related papers (2024-02-01T19:12:25Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Resources for Brewing BEIR: Reproducible Reference Models and an
Official Leaderboard [47.73060223236792]
BEIR is a benchmark dataset for evaluation of information retrieval models across 18 different domain/task combinations.
Our work addresses two shortcomings that prevent the benchmark from achieving its full potential.
arXiv Detail & Related papers (2023-06-13T00:26:18Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - How Robust are Model Rankings: A Leaderboard Customization Approach for
Equitable Evaluation [0.0]
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications.
We introduce a task-agnostic method to probe leaderboards by weighting samples based on their difficulty level.
We find that leaderboards can be adversarially attacked and top performing models may not always be the best models.
arXiv Detail & Related papers (2021-06-10T06:47:35Z) - Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z) - EXPLAINABOARD: An Explainable Leaderboard for NLP [69.59340280972167]
ExplainaBoard is a new conceptualization and implementation of NLP evaluation.
It allows researchers to (i) diagnose strengths and weaknesses of a single system and (ii) interpret relationships between multiple systems.
arXiv Detail & Related papers (2021-04-13T17:45:50Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.