Towards augmented data quality management: Automation of Data Quality Rule Definition in Data Warehouses
- URL: http://arxiv.org/abs/2406.10940v1
- Date: Sun, 16 Jun 2024 13:43:04 GMT
- Title: Towards augmented data quality management: Automation of Data Quality Rule Definition in Data Warehouses
- Authors: Heidi Carolina Tamm, Anastasija Nikiforova,
- Abstract summary: This study explores the potential for automating data quality management within data warehouses as data repository commonly used by large organizations.
The review encompassed 151 tools from various sources, revealing that most current tools focus on data cleansing and fixing in domain-specific databases rather than data warehouses.
Only a limited number of tools, specifically ten, demonstrated the capability to detect DQ rules, not to mention implementing this in data warehouses.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the contemporary data-driven landscape, ensuring data quality (DQ) is crucial for deriving actionable insights from vast data repositories. The objective of this study is to explore the potential for automating data quality management within data warehouses as data repository commonly used by large organizations. By conducting a systematic review of existing DQ tools available in the market and academic literature, the study assesses their capability to automatically detect and enforce data quality rules. The review encompassed 151 tools from various sources, revealing that most current tools focus on data cleansing and fixing in domain-specific databases rather than data warehouses. Only a limited number of tools, specifically ten, demonstrated the capability to detect DQ rules, not to mention implementing this in data warehouses. The findings underscore a significant gap in the market and academic research regarding AI-augmented DQ rule detection in data warehouses. This paper advocates for further development in this area to enhance the efficiency of DQ management processes, reduce human workload, and lower costs. The study highlights the necessity of advanced tools for automated DQ rule detection, paving the way for improved practices in data quality management tailored to data warehouse environments. The study can guide organizations in selecting data quality tool that would meet their requirements most.
Related papers
- A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing.
Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes.
These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z) - Data Quality in Edge Machine Learning: A State-of-the-Art Survey [2.8449839307925955]
Data-driven Artificial Intelligence (AI) systems trained using Machine Learning (ML) are shaping an ever-increasing portion of our lives.
On the one hand, the outsized influence of these systems imposes a high standard of quality, particularly in the data used to train them.
On the other hand, establishing and maintaining standards of Data Quality (DQ) becomes more challenging due to the proliferation of Edge Computing and Internet of Things devices.
arXiv Detail & Related papers (2024-06-01T23:07:05Z) - Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities.
We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z) - A Systematic Review of Available Datasets in Additive Manufacturing [56.684125592242445]
In-situ monitoring incorporating visual and other sensor technologies allows the collection of extensive datasets during the Additive Manufacturing process.
These datasets have potential for determining the quality of the manufactured output and the detection of defects through the use of Machine Learning.
This systematic review investigates the availability of open image-based datasets originating from AM processes that align with a number of pre-defined selection criteria.
arXiv Detail & Related papers (2024-01-27T16:13:32Z) - Data Management For Training Large Language Models: A Survey [64.18200694790787]
Data plays a fundamental role in training Large Language Models (LLMs)
This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
arXiv Detail & Related papers (2023-12-04T07:42:16Z) - Analyzing Dataset Annotation Quality Management in the Wild [63.07224587146207]
Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts.
While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
arXiv Detail & Related papers (2023-07-16T21:22:40Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Data Quality Toolkit: Automatic assessment of data quality and
remediation for machine learning datasets [11.417891017429882]
The Data Quality Toolkit for machine learning is a library of some key quality metrics and relevant remediation techniques.
It can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process.
arXiv Detail & Related papers (2021-08-12T19:22:27Z) - Quality Prediction of Open Educational Resources A Metadata-based
Approach [0.0]
Metadata play a key role in offering high quality services such as recommendation and search.
We propose an OER metadata scoring model, and build a metadata-based prediction model to anticipate the quality of OERs.
Based on our data and model, we were able to detect high-quality OERs with the F1 score of 94.6%.
arXiv Detail & Related papers (2020-05-21T09:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.