BClean: A Bayesian Data Cleaning System
- URL: http://arxiv.org/abs/2311.06517v1
- Date: Sat, 11 Nov 2023 09:22:07 GMT
- Title: BClean: A Bayesian Data Cleaning System
- Authors: Jianbin Qin, Sifan Huang, Yaoshu Wang, Jing Zhu, Yifan Zhang, Yukai
Miao, Rui Mao, Makoto Onizuka, Chuan Xiao
- Abstract summary: BClean is a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction.
By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning.
- Score: 17.525913626374503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a considerable body of work on data cleaning which employs various
principles to rectify erroneous data and transform a dirty dataset into a
cleaner one. One of prevalent approaches is probabilistic methods, including
Bayesian methods. However, existing probabilistic methods often assume a
simplistic distribution (e.g., Gaussian distribution), which is frequently
underfitted in practice, or they necessitate experts to provide a complex prior
distribution (e.g., via a programming language). This requirement is both
labor-intensive and costly, rendering these methods less suitable for
real-world applications. In this paper, we propose BClean, a Bayesian Cleaning
system that features automatic Bayesian network construction and user
interaction. We recast the data cleaning problem as a Bayesian inference that
fully exploits the relationships between attributes in the observed dataset and
any prior information provided by users. To this end, we present an automatic
Bayesian network construction method that extends a structure learning-based
functional dependency discovery method with similarity functions to capture the
relationships between attributes. Furthermore, our system allows users to
modify the generated Bayesian network in order to specify prior information or
correct inaccuracies identified by the automatic generation process. We also
design an effective scoring model (called the compensative scoring model)
necessary for the Bayesian inference. To enhance the efficiency of data
cleaning, we propose several approximation strategies for the Bayesian
inference, including graph partitioning, domain pruning, and pre-detection. By
evaluating on both real-world and synthetic datasets, we demonstrate that
BClean is capable of achieving an F-measure of up to 0.9 in data cleaning,
outperforming existing Bayesian methods by 2% and other data cleaning methods
by 15%.
Related papers
- A Federated Distributionally Robust Support Vector Machine with Mixture of Wasserstein Balls Ambiguity Set for Distributed Fault Diagnosis [3.662364375995991]
We study the problem of training a distributionally robust (DR) support vector machine (SVM) in a federated fashion over a network comprised of a central server and $G$ clients without sharing data.
We propose two distributed optimization algorithms for training the global FDR-SVM.
arXiv Detail & Related papers (2024-10-04T19:21:45Z) - Improvement and generalization of ABCD method with Bayesian inference [36.136619420474766]
We focus on taking advantage of available information and pour our effort in re-thinking the usual data-driven ABCD method to improve it.
We show how, in contrast to the ABCD method, one can take advantage of understanding some properties of the different backgrounds.
We show how, in this simplified model, the Bayesian framework outperforms the ABCD method sensitivity in obtaining the signal fraction.
arXiv Detail & Related papers (2024-02-12T19:05:27Z) - Gaussian Mixture Models for Affordance Learning using Bayesian Networks [50.18477618198277]
Affordances are fundamental descriptors of relationships between actions, objects and effects.
This paper approaches the problem of an embodied agent exploring the world and learning these affordances autonomously from its sensory experiences.
arXiv Detail & Related papers (2024-02-08T22:05:45Z) - VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values [2.9707233220536313]
Federated learning makes it possible to train a machine learning model on decentralized data.
We propose a novel method called VertiBayes to train Bayesian networks on vertically partitioned data.
We experimentally show our approach produces models comparable to those learnt using traditional algorithms.
arXiv Detail & Related papers (2022-10-31T11:13:35Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - Evaluating State-of-the-Art Classification Models Against Bayes
Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows.
We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z) - Probabilistic Case-based Reasoning for Open-World Knowledge Graph
Completion [59.549664231655726]
A case-based reasoning (CBR) system solves a new problem by retrieving cases' that are similar to the given problem.
In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs)
Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB.
arXiv Detail & Related papers (2020-10-07T17:48:12Z) - PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.
PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z) - A Tutorial on Learning With Bayesian Networks [8.98526174345299]
A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest.
A Bayesian network can be used to learn causal relationships.
It can also be used to gain understanding about a problem domain and to predict the consequences of intervention.
arXiv Detail & Related papers (2020-02-01T20:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.