ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
- URL: http://arxiv.org/abs/2203.03073v2
- Date: Wed, 9 Mar 2022 01:55:24 GMT
- Title: ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
- Authors: Neeraj Varshney, Swaroop Mishra, and Chitta Baral
- Abstract summary: We conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets.
We demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably.
- Score: 22.043291547405545
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge of questions' difficulty level helps a teacher in several ways,
such as estimating students' potential quickly by asking carefully selected
questions and improving quality of examination by modifying trivial and hard
questions. Can we extract such benefits of instance difficulty in NLP? To this
end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE)
in a large-scale setup of 23 datasets and demonstrate its five novel
applications: 1) conducting efficient-yet-accurate evaluations with fewer
instances saving computational cost and time, 2) improving quality of existing
evaluation datasets by repairing erroneous and trivial instances, 3) selecting
the best model based on application requirements, 4) analyzing dataset
characteristics for guiding future data creation, 5) estimating Out-of-Domain
performance reliably. Comprehensive experiments for these applications result
in several interesting findings, such as evaluation using just 5% instances
(selected via ILDAE) achieves as high as 0.93 Kendall correlation with
evaluation using complete dataset and computing weighted accuracy using
difficulty scores leads to 5.2% higher correlation with Out-of-Domain
performance. We release the difficulty scores and hope our analyses and
findings will bring more attention to this important yet understudied field of
leveraging instance difficulty in evaluations.
Related papers
- How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - Is Difficulty Calibration All We Need? Towards More Practical Membership Inference Attacks [16.064233621959538]
We propose a query-efficient and computation-efficient MIA that directly textbfRe-levertextbfAges the original membershitextbfP scores to mtextbfItigate the errors in textbfDifficulty calibration.
arXiv Detail & Related papers (2024-08-31T11:59:42Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation [0.9332308328407303]
Estimating conditional average dose responses (CADR) is an important but challenging problem.
Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance.
We propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance.
arXiv Detail & Related papers (2024-06-12T13:39:32Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Re-TACRED: Addressing Shortcomings of the TACRED Dataset [5.820381428297218]
TACRED is one of the largest and most widely used sentence-level relation extraction datasets.
Proposed models that are evaluated using this dataset consistently set new state-of-the-art performance.
However, they still exhibit large error rates despite leveraging external knowledge and unsupervised pretraining on large text corpora.
arXiv Detail & Related papers (2021-04-16T22:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.