Autoencoder-based cleaning in probabilistic databases
- URL: http://arxiv.org/abs/2106.09764v1
- Date: Thu, 17 Jun 2021 18:46:56 GMT
- Title: Autoencoder-based cleaning in probabilistic databases
- Authors: R.R. Mauritz, F.P.J. Nijweide, J. Goseling, M. van Keulen
- Abstract summary: We propose a data-cleaning autoencoder capable of near-automatic data quality improvement.
It learns the structure and dependencies in the data to identify and correct doubtful values.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of data integration, data quality problems are often encountered
when extracting, combining, and merging data. The probabilistic data
integration approach represents information about such problems as
uncertainties in a probabilistic database. In this paper, we propose a
data-cleaning autoencoder capable of near-automatic data quality improvement.
It learns the structure and dependencies in the data to identify and correct
doubtful values. A theoretical framework is provided, and experiments show that
it can remove significant amounts of noise from categorical and numeric
probabilistic data. Our method does not require clean data. We do, however,
show that manually cleaning a small fraction of the data significantly improves
performance.
Related papers
- Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - BClean: A Bayesian Data Cleaning System [17.525913626374503]
BClean is a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction.
By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning.
arXiv Detail & Related papers (2023-11-11T09:22:07Z) - Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels [56.81761908354718]
We propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels.
Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline.
We further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data.
arXiv Detail & Related papers (2023-01-02T07:13:28Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - An epistemic approach to model uncertainty in data-graphs [2.1261712640167847]
Graph databases can suffer from errors and discrepancies with respect to real-world data they intend to represent.
In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases.
We define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity.
arXiv Detail & Related papers (2021-09-29T00:08:27Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Batchwise Probabilistic Incremental Data Cleaning [5.035172070107058]
This report addresses the problem of performing holistic data cleaning incrementally.
To the best of our knowledge, our contributions compose the first incremental framework that cleans data.
Our approach outperforms the competitors with respect to repair quality, execution time, and memory consumption.
arXiv Detail & Related papers (2020-11-09T20:15:02Z) - PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.
PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z) - Establishing strong imputation performance of a denoising autoencoder in
a wide range of missing data problems [0.0]
We develop a consistent framework for both training and imputation.
We benchmarked the results against state-of-the-art imputation methods.
The developed autoencoder obtained the smallest error for all ranges of initial data corruption.
arXiv Detail & Related papers (2020-04-06T12:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.