AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration
- URL: http://arxiv.org/abs/2405.03870v1
- Date: Mon, 6 May 2024 21:36:45 GMT
- Title: AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration
- Authors: Widad Elouataoui,
- Abstract summary: This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread adoption of big data has ushered in a new era of data-driven decision-making, transforming numerous industries and sectors. However, the efficacy of these decisions hinges on the quality of the underlying data. Poor data quality can result in inaccurate analyses and deceptive conclusions. Managing the vast volume, velocity, and variety of data sources presents significant challenges, heightening the importance of addressing big data quality issues. While there has been increased attention from both academia and industry, current approaches often lack comprehensiveness and universality. They tend to focus on limited metrics, neglecting other dimensions of data quality. Moreover, existing methods are often context-specific, limiting their applicability across different domains. There is a clear need for intelligent, automated approaches leveraging artificial intelligence (AI) for advanced data quality corrections. To bridge these gaps, this Ph.D. thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively. Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment. Secondly, we present a generic framework for detecting various quality anomalies using AI models. Thirdly, we propose an innovative framework for correcting detected anomalies through predictive modeling. Additionally, we address metadata quality enhancement within big data ecosystems. These frameworks are rigorously tested on diverse datasets, demonstrating their efficacy in improving big data quality. Finally, the thesis concludes with insights and suggestions for future research directions.
Related papers
- Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts [0.0]
A huge number of detectors and collections with AI fragments have emerged, and several detection methods even showed recognition quality up to 99.9%.
Are detectors actually highly trustworthy or do their high benchmark scores come from the poor quality of evaluation datasets?
We present a systematic review of datasets from competitions dedicated to AI-generated content detection and propose methods for evaluating the quality of datasets containing AI-generated fragments.
arXiv Detail & Related papers (2024-10-18T17:59:57Z) - A Theoretical Framework for AI-driven data quality monitoring in high-volume data environments [1.2753215270475886]
This paper presents a theoretical framework for an AI-driven data quality monitoring system designed to address the challenges of maintaining data quality in high-volume environments.
We examine the limitations of traditional methods in managing the scale, velocity, and variety of big data and propose a conceptual approach leveraging advanced machine learning techniques.
Key components include an intelligent data ingestion layer, adaptive preprocessing mechanisms, context-aware feature extraction, and AI-based quality assessment modules.
arXiv Detail & Related papers (2024-10-11T07:06:36Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.
The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.
We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Data Readiness for AI: A 360-Degree Survey [0.9343816282846432]
Poor quality data produces inaccurate and ineffective AI models.
Numerous R&D efforts have been spent on improving data quality.
We propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets.
arXiv Detail & Related papers (2024-04-08T15:19:57Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Quality In / Quality Out: Assessing Data quality in an Anomaly Detection
Benchmark [0.13764085113103217]
We show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific Machine Learning technique considered.
Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
arXiv Detail & Related papers (2023-05-31T12:03:12Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z) - Anomaly Detection Based on Selection and Weighting in Latent Space [73.01328671569759]
We propose a novel selection-and-weighting-based anomaly detection framework called SWAD.
Experiments on both benchmark and real-world datasets have shown the effectiveness and superiority of SWAD.
arXiv Detail & Related papers (2021-03-08T10:56:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.