Attribute-Based Semantic Type Detection and Data Quality Assessment
- URL: http://arxiv.org/abs/2410.14692v1
- Date: Fri, 04 Oct 2024 09:22:44 GMT
- Title: Attribute-Based Semantic Type Detection and Data Quality Assessment
- Authors: Marcelo Valentim Silva, Hannes Herrmann, Valerie Maxville,
- Abstract summary: This research introduces an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment.
By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system.
A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach.
- Score: 0.5735035463793008
- License:
- Abstract: The reliance on data-driven decision-making across sectors highlights the critical need for high-quality data; despite advancements, data quality issues persist, significantly impacting business strategies and scientific research. Current data quality methods fail to leverage the semantic richness embedded in words inside attribute labels (or column names/headers in tables) across diverse datasets and domains, leaving a crucial gap in comprehensive data quality evaluation. This research addresses this gap by introducing an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system comprising approximately 23 types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage. A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach in terms of classification robustness and applicability to data quality assessment tasks. Our research focuses on well-known data quality issues and their corresponding data quality dimension violations, grounding our methodology in a robust academic framework. Detailed analysis of fifty distinct datasets from the UCI Machine Learning Repository showcases our method's proficiency in identifying potential data quality issues. Compared to established tools like YData Profiling, our method exhibits superior accuracy, detecting 81 missing values across 922 attributes where YData identified only one.
Related papers
- Unsupervised Anomaly Detection for Auditing Data and Impact of
Categorical Encodings [20.37092575427039]
Vehicle Claims dataset consists of fraudulent insurance claims for automotive repairs.
We tackle the common problem of missing benchmark datasets for anomaly detection.
The dataset is evaluated on shallow and deep learning methods.
arXiv Detail & Related papers (2022-10-25T14:33:17Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Weak Adaptation Learning -- Addressing Cross-domain Data Insufficiency
with Weak Annotator [2.8672054847109134]
In some target problem domains, there are not many data samples available, which could hinder the learning process.
We propose a weak adaptation learning (WAL) approach that leverages unlabeled data from a similar source domain.
Our experiments demonstrate the effectiveness of our approach in learning an accurate classifier with limited labeled data in the target domain.
arXiv Detail & Related papers (2021-02-15T06:19:25Z) - Through the Data Management Lens: Experimental Analysis and Evaluation
of Fair Classification [75.49600684537117]
Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness.
We contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, and stability.
Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance.
arXiv Detail & Related papers (2021-01-18T22:55:40Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - What is the Value of Data? On Mathematical Methods for Data Quality
Estimation [35.75162309592681]
We propose a formal definition for the quality of a given dataset.
We assess a dataset's quality by a quantity we call the expected diameter.
arXiv Detail & Related papers (2020-01-09T18:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.