A Survey of Parameters Associated with the Quality of Benchmarks in NLP
- URL: http://arxiv.org/abs/2210.07566v1
- Date: Fri, 14 Oct 2022 06:44:14 GMT
- Title: A Survey of Parameters Associated with the Quality of Benchmarks in NLP
- Authors: Swaroop Mishra, Anjana Arunkumar, Chris Bryan and Chitta Baral
- Abstract summary: Recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task.
A potential solution to these issues -- a metric quantifying quality -- remains underexplored.
We take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark.
- Score: 24.6240575061124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several benchmarks have been built with heavy investment in resources to
track our progress in NLP. Thousands of papers published in response to those
benchmarks have competed to top leaderboards, with models often surpassing
human performance. However, recent studies have shown that models triumph over
several popular benchmarks just by overfitting on spurious biases, without
truly learning the desired task. Despite this finding, benchmarking, while
trying to tackle bias, still relies on workarounds, which do not fully utilize
the resources invested in benchmark creation, due to the discarding of low
quality data, and cover limited sets of bias. A potential solution to these
issues -- a metric quantifying quality -- remains underexplored. Inspired by
successful quality indices in several domains such as power, food, and water,
we take the first step towards a metric by identifying certain language
properties that can represent various possible interactions leading to biases
in a benchmark. We look for bias related parameters which can potentially help
pave our way towards the metric. We survey existing works and identify
parameters capturing various properties of bias, their origins, types and
impact on performance, generalization, and robustness. Our analysis spans over
datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring
that our parameters are generic and are not overfitted towards a specific task
or dataset. We also develop certain parameters in this process.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for
Social Media NLP Research [33.698581876383074]
We introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval.
We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.
arXiv Detail & Related papers (2023-10-23T09:48:25Z) - Benchmark tasks for Quality-Diversity applied to Uncertain domains [1.5469452301122175]
We introduce a set of 8 easy-to-implement and lightweight tasks, split into 3 main categories.
We identify the key uncertainty properties to easily define UQD benchmark tasks.
All our tasks build on the Redundant Arm: a common QD environment that is lightweight and easily replicable.
arXiv Detail & Related papers (2023-04-24T21:23:26Z) - Towards QD-suite: developing a set of benchmarks for Quality-Diversity
algorithms [0.0]
Existing benchmarks are not standardized, and there is currently no MNIST equivalent for Quality-Diversity (QD)
We argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable benchmarks is an important step.
arXiv Detail & Related papers (2022-05-06T13:33:50Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - A critical analysis of metrics used for measuring progress in artificial
intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results.
Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance.
We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.