Statistical Dataset Evaluation: Reliability, Difficulty, and Validity
- URL: http://arxiv.org/abs/2212.09272v1
- Date: Mon, 19 Dec 2022 06:55:42 GMT
- Title: Statistical Dataset Evaluation: Reliability, Difficulty, and Validity
- Authors: Chengwen Wang, Qingxiu Dong, Xiaochen Wang, Haitao Wang and Zhifang
Sui
- Abstract summary: We propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation.
We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity.
- Score: 18.36931975072938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Datasets serve as crucial training resources and model performance trackers.
However, existing datasets have exposed a plethora of problems, inducing biased
models and unreliable evaluation results. In this paper, we propose a
model-agnostic dataset evaluation framework for automatic dataset quality
evaluation. We seek the statistical properties of the datasets and address
three fundamental dimensions: reliability, difficulty, and validity, following
a classical testing theory. Taking the Named Entity Recognition (NER) datasets
as a case study, we introduce $9$ statistical metrics for a statistical dataset
evaluation framework. Experimental results and human evaluation validate that
our evaluation framework effectively assesses various aspects of the dataset
quality. Furthermore, we study how the dataset scores on our statistical
metrics affect the model performance, and appeal for dataset quality evaluation
or targeted dataset improvement before training or testing models.
Related papers
- Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field.
The absence of systematic data quality checks poses complications for properly training and testing models.
We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z) - On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item.
We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z) - Proper Dataset Valuation by Pointwise Mutual Information [26.693741797887643]
We propose an information-theoretic framework for evaluating data curation methods.
We compare informativeness by the Shannon mutual information of the evaluated data and the test data.
Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Data Quality Evaluation using Probability Models [0.0]
It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate.
arXiv Detail & Related papers (2020-09-14T18:12:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.