PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming
- URL: http://arxiv.org/abs/2007.11838v4
- Date: Tue, 27 Oct 2020 18:41:52 GMT
- Title: PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming
- Authors: Alexander K. Lew, Monica Agrawal, David Sontag, Vikash K. Mansinghka
- Abstract summary: We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.
PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
- Score: 65.88506015656951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data cleaning can be naturally framed as probabilistic inference in a
generative model, combining a prior distribution over ground-truth databases
with a likelihood that models the noisy channel by which the data are filtered
and corrupted to yield incomplete, dirty, and denormalized datasets. Based on
this view, we present PClean, a probabilistic programming language for
leveraging dataset-specific knowledge to clean and normalize dirty data. PClean
is powered by three modeling and inference contributions: (1) a non-parametric
model of relational database instances, customizable via probabilistic
programs, (2) a sequential Monte Carlo inference algorithm that exploits the
model's structure, and (3) near-optimal SMC proposals and blocked Gibbs
rejuvenation moves constructed on a per-dataset basis. We show empirically that
short (< 50-line) PClean programs can be faster and more accurate than generic
PPL inference on multiple data-cleaning benchmarks; perform comparably in terms
of accuracy and runtime to state-of-the-art data-cleaning systems (unlike
generic PPL inference given the same runtime); and scale to real-world datasets
with millions of records.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - BClean: A Bayesian Data Cleaning System [17.525913626374503]
BClean is a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction.
By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning.
arXiv Detail & Related papers (2023-11-11T09:22:07Z) - On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks.
We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased.
Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z) - Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels [56.81761908354718]
We propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels.
Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline.
We further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data.
arXiv Detail & Related papers (2023-01-02T07:13:28Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - An epistemic approach to model uncertainty in data-graphs [2.1261712640167847]
Graph databases can suffer from errors and discrepancies with respect to real-world data they intend to represent.
In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases.
We define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity.
arXiv Detail & Related papers (2021-09-29T00:08:27Z) - Noise-Resistant Deep Metric Learning with Probabilistic Instance
Filtering [59.286567680389766]
Noisy labels are commonly found in real-world data, which cause performance degradation of deep neural networks.
We propose Probabilistic Ranking-based Instance Selection with Memory (PRISM) approach for DML.
PRISM calculates the probability of a label being clean, and filters out potentially noisy samples.
arXiv Detail & Related papers (2021-08-03T12:15:25Z) - Autoencoder-based cleaning in probabilistic databases [0.0]
We propose a data-cleaning autoencoder capable of near-automatic data quality improvement.
It learns the structure and dependencies in the data to identify and correct doubtful values.
arXiv Detail & Related papers (2021-06-17T18:46:56Z) - tsrobprep -- an R package for robust preprocessing of time series data [0.0]
The open source package tsrobprep introduces efficient methods for handling missing values and outliers.
For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs.
For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers typical time series related properties as features.
arXiv Detail & Related papers (2021-04-26T15:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.