Batchwise Probabilistic Incremental Data Cleaning
- URL: http://arxiv.org/abs/2011.04730v1
- Date: Mon, 9 Nov 2020 20:15:02 GMT
- Title: Batchwise Probabilistic Incremental Data Cleaning
- Authors: Paulo H. Oliveira, Daniel S. Kaster, Caetano Traina-Jr., Ihab F. Ilyas
- Abstract summary: This report addresses the problem of performing holistic data cleaning incrementally.
To the best of our knowledge, our contributions compose the first incremental framework that cleans data.
Our approach outperforms the competitors with respect to repair quality, execution time, and memory consumption.
- Score: 5.035172070107058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lack of data and data quality issues are among the main bottlenecks that
prevent further artificial intelligence adoption within many organizations,
pushing data scientists to spend most of their time cleaning data before being
able to answer analytical questions. Hence, there is a need for more effective
and efficient data cleaning solutions, which, not surprisingly, is rife with
theoretical and engineering problems. This report addresses the problem of
performing holistic data cleaning incrementally, given a fixed rule set and an
evolving categorical relational dataset acquired in sequential batches. To the
best of our knowledge, our contributions compose the first incremental
framework that cleans data (i) independently of user interventions, (ii)
without requiring knowledge about the incoming dataset, such as the number of
classes per attribute, and (iii) holistically, enabling multiple error types to
be repaired simultaneously, and thus avoiding conflicting repairs. Extensive
experiments show that our approach outperforms the competitors with respect to
repair quality, execution time, and memory consumption.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Towards Explainable Automated Data Quality Enhancement without Domain Knowledge [0.0]
We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset.
Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence.
We adopt a hybrid approach that integrates statistical methods with machine learning algorithms.
arXiv Detail & Related papers (2024-09-16T10:08:05Z) - Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - Make Every Example Count: On the Stability and Utility of Self-Influence
for Learning from Noisy NLP Datasets [4.142507103595571]
We study the fitness of task-agnostic self-influence scores of training examples for data cleaning.
We analyze their efficacy in capturing naturally occurring outliers.
arXiv Detail & Related papers (2023-02-27T17:00:06Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Autoencoder-based cleaning in probabilistic databases [0.0]
We propose a data-cleaning autoencoder capable of near-automatic data quality improvement.
It learns the structure and dependencies in the data to identify and correct doubtful values.
arXiv Detail & Related papers (2021-06-17T18:46:56Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.