Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring
- URL: http://arxiv.org/abs/2108.08905v1
- Date: Mon, 16 Aug 2021 12:20:57 GMT
- Title: Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring
- Authors: Sezal Chug, Priya Kaushal, Ponnurangam Kumaraguru, Tavpritesh Sethi
- Abstract summary: The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
- Score: 8.864453148536061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is expanding at an unimaginable rate, and with this development comes
the responsibility of the quality of data. Data Quality refers to the relevance
of the information present and helps in various operations like decision making
and planning in a particular organization. Mostly data quality is measured on
an ad-hoc basis, and hence none of the developed concepts provide any practical
application. The current empirical study was undertaken to formulate a concrete
automated data quality platform to assess the quality of incoming dataset and
generate a quality label, score and comprehensive report. We utilize various
datasets from healthdata.gov, opendata.nhs and Demographics and Health Surveys
(DHS) Program to observe the variations in the quality score and formulate a
label using Principal Component Analysis(PCA). The results of the current
empirical study revealed a metric that encompasses nine quality ingredients,
namely provenance, dataset characteristics, uniformity, metadata coupling,
percentage of missing cells and duplicate rows, skewness of data, the ratio of
inconsistencies of categorical columns, and correlation between these
attributes. The study also provides an illustrative case study and validation
of the metric following Mutation Testing approaches. This research study
provides an automated platform which takes an incoming dataset and metadata to
provide the DQ score, report and label. The results of this study would be
useful to data scientists as the value of this quality label would instill
confidence before deploying the data for his/her respective practical
application.
Related papers
- A Guide to Misinformation Detection Datasets [5.673951146506489]
This guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations.
All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
arXiv Detail & Related papers (2024-11-07T18:47:39Z) - Attribute-Based Semantic Type Detection and Data Quality Assessment [0.5735035463793008]
This research introduces an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment.
By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system.
A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach.
arXiv Detail & Related papers (2024-10-04T09:22:44Z) - ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics [58.50754318846996]
In this paper, we show that the performances of metrics are sensitive to data.
The ranking of metrics varies when the evaluation is conducted on different datasets.
arXiv Detail & Related papers (2022-03-29T18:58:28Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - What is the Value of Data? On Mathematical Methods for Data Quality
Estimation [35.75162309592681]
We propose a formal definition for the quality of a given dataset.
We assess a dataset's quality by a quantity we call the expected diameter.
arXiv Detail & Related papers (2020-01-09T18:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.