Towards Explainable Automated Data Quality Enhancement without Domain Knowledge
- URL: http://arxiv.org/abs/2409.10139v1
- Date: Mon, 16 Sep 2024 10:08:05 GMT
- Title: Towards Explainable Automated Data Quality Enhancement without Domain Knowledge
- Authors: Djibril Sarr,
- Abstract summary: We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset.
Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence.
We adopt a hybrid approach that integrates statistical methods with machine learning algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability, enabling users to trust and comprehend the assessment process. Acknowledging the challenges associated with automating the data quality assessment process, particularly in terms of time efficiency and accuracy, we adopt a pragmatic strategy, employing resource-intensive algorithms only when necessary, while favoring simpler, more efficient solutions whenever possible. Through a practical analysis conducted on a publicly provided dataset, we illustrate the challenges that arise when trying to enhance data quality while keeping explainability. We demonstrate the effectiveness of our approach in detecting and rectifying missing values, duplicates and typographical errors as well as the challenges remaining to be addressed to achieve similar accuracy on statistical outliers and logic errors under the constraints set in our work.
Related papers
- Optimisation Strategies for Ensuring Fairness in Machine Learning: With and Without Demographics [4.662958544712181]
This paper introduces two formal frameworks to tackle open questions in machine learning fairness.
In one framework, operator-valued optimisation and min-max objectives are employed to address unfairness in time-series problems.
In the second framework, the challenge of lacking sensitive attributes, such as gender and race, in commonly used datasets is addressed.
arXiv Detail & Related papers (2024-11-13T22:29:23Z) - AdapFair: Ensuring Continuous Fairness for Machine Learning Operations [7.909259406397651]
We present a debiasing framework designed to find an optimal fair transformation of input data.
We leverage the normalizing flows to enable efficient, information-preserving data transformation.
We introduce an efficient optimization algorithm with closed-formed gradient computations.
arXiv Detail & Related papers (2024-09-23T15:01:47Z) - Data Valuation with Gradient Similarity [1.997283751398032]
Data Valuation algorithms quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task.
We present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS)
Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.
arXiv Detail & Related papers (2024-05-13T22:10:00Z) - Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.
We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.
We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - Achievable Fairness on Your Data With Utility Guarantees [16.78730663293352]
In machine learning fairness, training models that minimize disparity across different sensitive groups often leads to diminished accuracy.
We present a computationally efficient approach to approximate the fairness-accuracy trade-off curve tailored to individual datasets.
We introduce a novel methodology for quantifying uncertainty in our estimates, thereby providing practitioners with a robust framework for auditing model fairness.
arXiv Detail & Related papers (2024-02-27T00:59:32Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Low-Regret Active learning [64.36270166907788]
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training.
At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on predictable (easy) instances.
arXiv Detail & Related papers (2021-04-06T22:53:45Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.