Related papers: EXPLAINABOARD: An Explainable Leaderboard for NLP

EXPLAINABOARD: An Explainable Leaderboard for NLP

URL: http://arxiv.org/abs/2104.06387v1
Date: Tue, 13 Apr 2021 17:45:50 GMT
Title: EXPLAINABOARD: An Explainable Leaderboard for NLP
Authors: Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaicheng Chang, Junqi Dai, Yixin Liu, Zihuiwen Ye, Graham Neubig
Abstract summary: ExplainaBoard is a new conceptualization and implementation of NLP evaluation. It allows researchers to (i) diagnose strengths and weaknesses of a single system and (ii) interpret relationships between multiple systems.
Score: 69.59340280972167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g. what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g. where does system A outperform system B? What if we combine systems A, B, C?) and (iii) examine prediction results closely (e.g. what are common errors made by multiple systems or and in what contexts do particular errors occur?). ExplainaBoard has been deployed at \url{http://explainaboard.nlpedia.ai/}, and we have additionally released our interpretable evaluation code at \url{https://github.com/neulab/ExplainaBoard} and output files from more than 300 systems, 40 datasets, and 9 tasks to motivate the "output-driven" research in the future.

Related papers

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards [67.65408769829524]
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. automatic leaderboard construction has emerged as a solution to reduce manual labor.
arXiv Detail & Related papers (2024-09-19T11:12:27Z)
Joint Speech Activity and Overlap Detection with Multi-Exit Architecture [5.4878772986187565]
Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. This study investigates the joint VAD and OSD task from a new perspective. In particular, we propose to extend traditional classification network with multi-exit architecture.
arXiv Detail & Related papers (2022-09-24T02:34:11Z)
PGX: A Multi-level GNN Explanation Framework Based on Separate Knowledge Distillation Processes [0.2005299372367689]
We propose a multi-level GNN explanation framework based on an observation that GNN is a multimodal learning process of multiple components in graph data. The complexity of the original problem is relaxed by breaking into multiple sub-parts represented as a hierarchical structure. We also aim for personalized explanations as the framework can generate different results based on user preferences.
arXiv Detail & Related papers (2022-08-05T10:14:48Z)
A novel evaluation methodology for supervised Feature Ranking algorithms [0.0]
This paper proposes a new evaluation methodology for Feature Rankers. By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation. To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval.
arXiv Detail & Related papers (2022-07-09T12:00:36Z)
BLISS: Robust Sequence-to-Sequence Learning via Self-Supervised Input Representation [92.75908003533736]
We propose a framework-level robust sequence-to-sequence learning approach, named BLISS, via self-supervised input representation. We conduct comprehensive experiments to validate the effectiveness of BLISS on various tasks, including machine translation, grammatical error correction, and text summarization.
arXiv Detail & Related papers (2022-04-16T16:19:47Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task. We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z)
Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis [61.740077541531726]
We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the community. Our analysis highlights existing problems during the evaluation of KGQA systems.
arXiv Detail & Related papers (2022-01-20T13:46:01Z)
SpanNer: Named Entity Re-/Recognition as Span Prediction [62.66148736099347]
span prediction model is used for named entity recognition. We experimentally implement 154 systems on 11 datasets, covering three languages. Our model has been deployed into the ExplainaBoard platform.
arXiv Detail & Related papers (2021-06-01T17:11:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.