A Survey of Parameters Associated with the Quality of Benchmarks in NLP
- URL: http://arxiv.org/abs/2210.07566v1
- Date: Fri, 14 Oct 2022 06:44:14 GMT
- Title: A Survey of Parameters Associated with the Quality of Benchmarks in NLP
- Authors: Swaroop Mishra, Anjana Arunkumar, Chris Bryan and Chitta Baral
- Abstract summary: Recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task.
A potential solution to these issues -- a metric quantifying quality -- remains underexplored.
We take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark.
- Score: 24.6240575061124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several benchmarks have been built with heavy investment in resources to
track our progress in NLP. Thousands of papers published in response to those
benchmarks have competed to top leaderboards, with models often surpassing
human performance. However, recent studies have shown that models triumph over
several popular benchmarks just by overfitting on spurious biases, without
truly learning the desired task. Despite this finding, benchmarking, while
trying to tackle bias, still relies on workarounds, which do not fully utilize
the resources invested in benchmark creation, due to the discarding of low
quality data, and cover limited sets of bias. A potential solution to these
issues -- a metric quantifying quality -- remains underexplored. Inspired by
successful quality indices in several domains such as power, food, and water,
we take the first step towards a metric by identifying certain language
properties that can represent various possible interactions leading to biases
in a benchmark. We look for bias related parameters which can potentially help
pave our way towards the metric. We survey existing works and identify
parameters capturing various properties of bias, their origins, types and
impact on performance, generalization, and robustness. Our analysis spans over
datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring
that our parameters are generic and are not overfitted towards a specific task
or dataset. We also develop certain parameters in this process.
Related papers
- BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP.
Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details.
While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for
Social Media NLP Research [33.698581876383074]
We introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval.
We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.
arXiv Detail & Related papers (2023-10-23T09:48:25Z) - Towards QD-suite: developing a set of benchmarks for Quality-Diversity
algorithms [0.0]
Existing benchmarks are not standardized, and there is currently no MNIST equivalent for Quality-Diversity (QD)
We argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable benchmarks is an important step.
arXiv Detail & Related papers (2022-05-06T13:33:50Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.