An epistemic approach to model uncertainty in data-graphs
- URL: http://arxiv.org/abs/2109.14112v1
- Date: Wed, 29 Sep 2021 00:08:27 GMT
- Title: An epistemic approach to model uncertainty in data-graphs
- Authors: Sergio Abriola, Santiago Cifuentes, Mar\'ia Vanina Mart\'inez, Nina
Pardal, Edwin Pin
- Abstract summary: Graph databases can suffer from errors and discrepancies with respect to real-world data they intend to represent.
In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases.
We define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity.
- Score: 2.1261712640167847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph databases are becoming widely successful as data models that allow to
effectively represent and process complex relationships among various types of
data. As with any other type of data repository, graph databases may suffer
from errors and discrepancies with respect to the real-world data they intend
to represent. In this work we explore the notion of probabilistic unclean graph
databases, previously proposed for relational databases, in order to capture
the idea that the observed (unclean) graph database is actually the noisy
version of a clean one that correctly models the world but that we know
partially. As the factors that may be involved in the observation can be many,
e.g, all different types of clerical errors or unintended transformations of
the data, we assume a probabilistic model that describes the distribution over
all possible ways in which the clean (uncertain) database could have been
polluted. Based on this model we define two computational problems: data
cleaning and probabilistic query answering and study for both of them their
corresponding complexity when considering that the transformation of the
database can be caused by either removing (subset) or adding (superset) nodes
and edges.
Related papers
- Estimating Causal Effects from Learned Causal Networks [56.14597641617531]
We propose an alternative paradigm for answering causal-effect queries over discrete observable variables.
We learn the causal Bayesian network and its confounding latent variables directly from the observational data.
We show that this emphmodel completion learning approach can be more effective than estimand approaches.
arXiv Detail & Related papers (2024-08-26T08:39:09Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Autoencoder-based cleaning in probabilistic databases [0.0]
We propose a data-cleaning autoencoder capable of near-automatic data quality improvement.
It learns the structure and dependencies in the data to identify and correct doubtful values.
arXiv Detail & Related papers (2021-06-17T18:46:56Z) - Evaluating State-of-the-Art Classification Models Against Bayes
Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows.
We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z) - PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.
PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z) - Learning Over Dirty Data Without Cleaning [12.892359722606681]
Real-world datasets are dirty and contain many errors.
Learning over dirty databases may result in inaccurate models.
We propose DLearn, a novel relational learning system.
arXiv Detail & Related papers (2020-04-05T20:21:13Z) - Symbolic Querying of Vector Spaces: Probabilistic Databases Meets
Relational Embeddings [35.877591735510734]
We formalize a probabilistic database model with respect to which all queries are done.
The lack of a well-defined joint probability distribution causes simple query problems to become provably hard.
We introduce TO, a relational embedding model designed to be a tractable probabilistic database.
arXiv Detail & Related papers (2020-02-24T01:17:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.