Related papers: A Novel Metric for Measuring Data Quality in Classification Applications (extended version)

A Novel Metric for Measuring Data Quality in Classification Applications (extended version)

URL: http://arxiv.org/abs/2312.08066v1
Date: Wed, 13 Dec 2023 11:20:09 GMT
Title: A Novel Metric for Measuring Data Quality in Classification Applications (extended version)
Authors: Jouseau Roxane, Salva S\'ebastien, Samir Chafik
Abstract summary: We introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. We provide an interpretation of each criterion and examples of assessment levels.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available observations. Indeed, without a clear understanding of the training and testing processes, it is hard to evaluate the intrinsic performance of a model. Besides, tools allowing to measure data quality specific to machine learning are still lacking. In this paper, we introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. The proposed method has the major advantage of being model-independent. Furthermore, we provide an interpretation of each criterion and examples of assessment levels. We confirm the utility of the proposed metric with intensive numerical experiments and detail some illustrative cases with controlled and interpretable qualities.

Related papers

Beyond Models! Explainable Data Valuation and Metric Adaption for Recommendation [10.964035199849125]
Current methods employ data valuation to discern high-quality data from low-quality data. We propose an explainable and versatile framework DVR which can enhance the efficiency of data utilization tailored to any requirements. Our framework achieves up to 34.7% improvements over existing methods in terms of representative NDCG metric.
arXiv Detail & Related papers (2025-02-12T12:01:08Z)
Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance [0.0]
Traditional metrics like accuracy, F1-score, and precision are frequently used to evaluate machine learning models. A dataset-adaptive, normalized metric that incorporates dataset characteristics like size, feature dimensionality, class imbalance, and signal-to-noise ratio is presented.
arXiv Detail & Related papers (2024-12-10T07:10:00Z)
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field. The absence of systematic data quality checks poses complications for properly training and testing models. We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z)
QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality. This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z)
Quality In / Quality Out: Assessing Data quality in an Anomaly Detection Benchmark [0.13764085113103217]
We show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific Machine Learning technique considered. Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
arXiv Detail & Related papers (2023-05-31T12:03:12Z)
Striving for data-model efficiency: Identifying data externalities on group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population. Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
What is the Value of Data? On Mathematical Methods for Data Quality Estimation [35.75162309592681]
We propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter.
arXiv Detail & Related papers (2020-01-09T18:56:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.