EXPLAINABOARD: An Explainable Leaderboard for NLP
- URL: http://arxiv.org/abs/2104.06387v1
- Date: Tue, 13 Apr 2021 17:45:50 GMT
- Title: EXPLAINABOARD: An Explainable Leaderboard for NLP
- Authors: Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaicheng Chang,
Junqi Dai, Yixin Liu, Zihuiwen Ye, Graham Neubig
- Abstract summary: ExplainaBoard is a new conceptualization and implementation of NLP evaluation.
It allows researchers to (i) diagnose strengths and weaknesses of a single system and (ii) interpret relationships between multiple systems.
- Score: 69.59340280972167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of NLP research, leaderboards have emerged as one
tool to track the performance of various systems on various NLP tasks. They are
effective in this goal to some extent, but generally present a rather
simplistic one-dimensional view of the submitted systems, communicated only
through holistic accuracy numbers. In this paper, we present a new
conceptualization and implementation of NLP evaluation: the ExplainaBoard,
which in addition to inheriting the functionality of the standard leaderboard,
also allows researchers to (i) diagnose strengths and weaknesses of a single
system (e.g. what is the best-performing system bad at?) (ii) interpret
relationships between multiple systems. (e.g. where does system A outperform
system B? What if we combine systems A, B, C?) and (iii) examine prediction
results closely (e.g. what are common errors made by multiple systems or and in
what contexts do particular errors occur?). ExplainaBoard has been deployed at
\url{http://explainaboard.nlpedia.ai/}, and we have additionally released our
interpretable evaluation code at \url{https://github.com/neulab/ExplainaBoard}
and output files from more than 300 systems, 40 datasets, and 9 tasks to
motivate the "output-driven" research in the future.
Related papers
- Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards [67.65408769829524]
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods.
The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually.
automatic leaderboard construction has emerged as a solution to reduce manual labor.
arXiv Detail & Related papers (2024-09-19T11:12:27Z) - Joint Speech Activity and Overlap Detection with Multi-Exit Architecture [5.4878772986187565]
Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion.
This study investigates the joint VAD and OSD task from a new perspective.
In particular, we propose to extend traditional classification network with multi-exit architecture.
arXiv Detail & Related papers (2022-09-24T02:34:11Z) - PGX: A Multi-level GNN Explanation Framework Based on Separate Knowledge
Distillation Processes [0.2005299372367689]
We propose a multi-level GNN explanation framework based on an observation that GNN is a multimodal learning process of multiple components in graph data.
The complexity of the original problem is relaxed by breaking into multiple sub-parts represented as a hierarchical structure.
We also aim for personalized explanations as the framework can generate different results based on user preferences.
arXiv Detail & Related papers (2022-08-05T10:14:48Z) - A novel evaluation methodology for supervised Feature Ranking algorithms [0.0]
This paper proposes a new evaluation methodology for Feature Rankers.
By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation.
To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval.
arXiv Detail & Related papers (2022-07-09T12:00:36Z) - BLISS: Robust Sequence-to-Sequence Learning via Self-Supervised Input
Representation [92.75908003533736]
We propose a framework-level robust sequence-to-sequence learning approach, named BLISS, via self-supervised input representation.
We conduct comprehensive experiments to validate the effectiveness of BLISS on various tasks, including machine translation, grammatical error correction, and text summarization.
arXiv Detail & Related papers (2022-04-16T16:19:47Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z) - Knowledge Graph Question Answering Leaderboard: A Community Resource to
Prevent a Replication Crisis [61.740077541531726]
We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the community.
Our analysis highlights existing problems during the evaluation of KGQA systems.
arXiv Detail & Related papers (2022-01-20T13:46:01Z) - SpanNer: Named Entity Re-/Recognition as Span Prediction [62.66148736099347]
span prediction model is used for named entity recognition.
We experimentally implement 154 systems on 11 datasets, covering three languages.
Our model has been deployed into the ExplainaBoard platform.
arXiv Detail & Related papers (2021-06-01T17:11:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.