Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems
- URL: http://arxiv.org/abs/2411.03007v1
- Date: Tue, 05 Nov 2024 11:12:25 GMT
- Title: Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems
- Authors: Sijie Dong, Soror Sahri, Themis Palpanas,
- Abstract summary: We present a review of the evolution of data quality awareness from traditional data management systems to modern data-driven AI systems.
As data science systems support a wide range of activities, our focus in this paper lies specifically in the analytics aspect driven by machine learning.
- Score: 9.490118207943196
- License:
- Abstract: Artificial intelligence (AI) has transformed various fields, significantly impacting our daily lives. A major factor in AI success is high-quality data. In this paper, we present a comprehensive review of the evolution of data quality (DQ) awareness from traditional data management systems to modern data-driven AI systems, which are integral to data science. We synthesize the existing literature, highlighting the quality challenges and techniques that have evolved from traditional data management to data science including big data and ML fields. As data science systems support a wide range of activities, our focus in this paper lies specifically in the analytics aspect driven by machine learning. We use the cause-effect connection between the quality challenges of ML and those of big data to allow a more thorough understanding of emerging DQ challenges and the related quality awareness techniques in data science systems. To the best of our knowledge, our paper is the first to provide a review of DQ awareness spanning traditional and emergent data science systems. We hope that readers will find this journey through the evolution of data quality awareness insightful and valuable.
Related papers
- Data Quality in Edge Machine Learning: A State-of-the-Art Survey [2.8449839307925955]
Data-driven Artificial Intelligence (AI) systems trained using Machine Learning (ML) are shaping an ever-increasing portion of our lives.
On the one hand, the outsized influence of these systems imposes a high standard of quality, particularly in the data used to train them.
On the other hand, establishing and maintaining standards of Data Quality (DQ) becomes more challenging due to the proliferation of Edge Computing and Internet of Things devices.
arXiv Detail & Related papers (2024-06-01T23:07:05Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Data-centric Artificial Intelligence: A Survey [47.24049907785989]
Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI.
In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals.
We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle.
arXiv Detail & Related papers (2023-03-17T17:44:56Z) - Data-centric AI: Perspectives and Challenges [51.70828802140165]
Data-centric AI (DCAI) advocates a fundamental shift from model advancements to ensuring data quality and reliability.
We bring together three general missions: training data development, inference data development, and data maintenance.
arXiv Detail & Related papers (2023-01-12T05:28:59Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - Data Smells in Public Datasets [7.1460275491017144]
We introduce a novel catalogue of data smells that can be used to indicate early signs of problems in machine learning systems.
To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.
arXiv Detail & Related papers (2022-03-15T15:44:20Z) - Data Science: Challenges and Directions [42.98602883069444]
We review hundreds of pieces of literature which include data science in their titles.
We find that the majority of the discussions essentially concern statistics, data mining, machine learning, big data, or broadly data analytics.
We focus on the research and innovation challenges inspired by the nature of data science problems as complex systems.
arXiv Detail & Related papers (2020-06-28T01:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.