Detecting Quality Problems in Data Models by Clustering Heterogeneous
Data Values
- URL: http://arxiv.org/abs/2111.06661v1
- Date: Fri, 12 Nov 2021 11:05:18 GMT
- Title: Detecting Quality Problems in Data Models by Clustering Heterogeneous
Data Values
- Authors: Viola Wenz, Arno Kesper, Gabriele Taentzer
- Abstract summary: We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values.
All values of a selected data field are clustered by syntactic similarity.
It shall help domain experts to understand how the data model is used in practice and to derive potential quality problems of the data model.
- Score: 1.143020642249583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data is of high quality if it is fit for its intended use. The quality of
data is influenced by the underlying data model and its quality. One major
quality problem is the heterogeneity of data as quality aspects such as
understandability and interoperability are impaired. This heterogeneity may be
caused by quality problems in the data model. Data heterogeneity can occur in
particular when the information given is not structured enough and just
captured in data values, often due to missing or non-suitable structure in the
underlying data model. We propose a bottom-up approach to detecting quality
problems in data models that manifest in heterogeneous data values. It supports
an explorative analysis of the existing data and can be configured by domain
experts according to their domain knowledge. All values of a selected data
field are clustered by syntactic similarity. Thereby an overview of the data
values' diversity in syntax is provided. It shall help domain experts to
understand how the data model is used in practice and to derive potential
quality problems of the data model. We outline a proof-of-concept
implementation and evaluate our approach using cultural heritage data.
Related papers
- Attribute-Based Semantic Type Detection and Data Quality Assessment [0.5735035463793008]
This research introduces an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment.
By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system.
A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach.
arXiv Detail & Related papers (2024-10-04T09:22:44Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Enhancing Data Quality in Federated Fine-Tuning of Foundation Models [54.757324343062734]
We propose a data quality control pipeline for federated fine-tuning of foundation models.
This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard.
Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
arXiv Detail & Related papers (2024-03-07T14:28:04Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Variational Selective Autoencoder: Learning from Partially-Observed
Heterogeneous Data [45.23338389559936]
We propose the variational selective autoencoder (VSAE) to learn representations from partially-observed heterogeneous data.
VSAE learns the latent dependencies in heterogeneous data by modeling the joint distribution of observed data, unobserved data, and the imputation mask.
It results in a unified model for various downstream tasks including data generation and imputation.
arXiv Detail & Related papers (2021-02-25T04:39:13Z) - Data Quality Evaluation using Probability Models [0.0]
It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate.
arXiv Detail & Related papers (2020-09-14T18:12:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.