Related papers: PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

URL: http://arxiv.org/abs/2007.11838v4
Date: Tue, 27 Oct 2020 18:41:52 GMT
Title: PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Authors: Alexander K. Lew, Monica Agrawal, David Sontag, Vikash K. Mansinghka
Abstract summary: We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
Score: 65.88506015656951
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data cleaning can be naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (< 50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
BClean: A Bayesian Data Cleaning System [17.525913626374503]
BClean is a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning.
arXiv Detail & Related papers (2023-11-11T09:22:07Z)
On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z)
Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels [56.81761908354718]
We propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline. We further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data.
arXiv Detail & Related papers (2023-01-02T07:13:28Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
An epistemic approach to model uncertainty in data-graphs [2.1261712640167847]
Graph databases can suffer from errors and discrepancies with respect to real-world data they intend to represent. In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases. We define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity.
arXiv Detail & Related papers (2021-09-29T00:08:27Z)
Noise-Resistant Deep Metric Learning with Probabilistic Instance Filtering [59.286567680389766]
Noisy labels are commonly found in real-world data, which cause performance degradation of deep neural networks. We propose Probabilistic Ranking-based Instance Selection with Memory (PRISM) approach for DML. PRISM calculates the probability of a label being clean, and filters out potentially noisy samples.
arXiv Detail & Related papers (2021-08-03T12:15:25Z)
Autoencoder-based cleaning in probabilistic databases [0.0]
We propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data to identify and correct doubtful values.
arXiv Detail & Related papers (2021-06-17T18:46:56Z)
tsrobprep -- an R package for robust preprocessing of time series data [0.0]
The open source package tsrobprep introduces efficient methods for handling missing values and outliers. For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs. For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers typical time series related properties as features.
arXiv Detail & Related papers (2021-04-26T15:35:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.