Learning Over Dirty Data Without Cleaning
- URL: http://arxiv.org/abs/2004.02308v1
- Date: Sun, 5 Apr 2020 20:21:13 GMT
- Title: Learning Over Dirty Data Without Cleaning
- Authors: Jose Picado, John Davis, Arash Termehchy, Ga Young Lee
- Abstract summary: Real-world datasets are dirty and contain many errors.
Learning over dirty databases may result in inaccurate models.
We propose DLearn, a novel relational learning system.
- Score: 12.892359722606681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world datasets are dirty and contain many errors. Examples of these
issues are violations of integrity constraints, duplicates, and inconsistencies
in representing data values and entities. Learning over dirty databases may
result in inaccurate models. Users have to spend a great deal of time and
effort to repair data errors and create a clean database for learning.
Moreover, as the information required to repair these errors is not often
available, there may be numerous possible clean versions for a dirty database.
We propose DLearn, a novel relational learning system that learns directly over
dirty databases effectively and efficiently without any preprocessing. DLearn
leverages database constraints to learn accurate relational models over
inconsistent and heterogeneous data. Its learned models represent patterns over
all possible clean instances of the data in a usable form. Our empirical study
indicates that DLearn learns accurate models over large real-world databases
efficiently.
Related papers
- Certain and Approximately Certain Models for Statistical Learning [4.318959672085627]
We show that it is possible to learn accurate models directly from data with missing values for certain training data and target models.
We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary.
arXiv Detail & Related papers (2024-02-27T22:49:33Z) - Identifying and Mitigating Model Failures through Few-shot CLIP-aided
Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations.
These descriptions can be used to generate synthetic data using generative models, such as diffusion models.
Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems.
We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z) - Repairing Systematic Outliers by Learning Clean Subspaces in VAEs [31.298063226774115]
We propose Clean Subspace Vari Autoencoder (VAE), a novel semi-supervised model for detection and automated repair of systematic errors.
VAE is effective with much less labelled data compared to previous models, often with less than 2% of the data.
We provide experiments using three image datasets in scenarios with different levels of corruption and labelled set sizes.
arXiv Detail & Related papers (2022-07-17T01:28:23Z) - An epistemic approach to model uncertainty in data-graphs [2.1261712640167847]
Graph databases can suffer from errors and discrepancies with respect to real-world data they intend to represent.
In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases.
We define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity.
arXiv Detail & Related papers (2021-09-29T00:08:27Z) - SSSE: Efficiently Erasing Samples from Trained Machine Learning Models [103.43466657962242]
We propose an efficient and effective algorithm, SSSE, for samples erasure.
In certain cases SSSE can erase samples almost as well as the optimal, yet impractical, gold standard of training a new model from scratch with only the permitted data.
arXiv Detail & Related papers (2021-07-08T14:17:24Z) - On the Pitfalls of Learning with Limited Data: A Facial Expression
Recognition Case Study [0.5249805590164901]
We focus on the problem of Facial Expression Recognition from videos.
We performed an extensive study with four databases at a different complexity and nine deep-learning architectures for video classification.
We found that complex training sets translate better to more stable test sets when trained with transfer learning and synthetically generated data.
arXiv Detail & Related papers (2021-04-02T18:53:41Z) - Self-Updating Models with Error Remediation [0.5156484100374059]
We propose a framework, Self-Updating Models with Error Remediation (SUMER), in which a deployed model updates itself as new data becomes available.
A key component of SUMER is the notion of error remediation as self-labeled data can be susceptible to the propagation of errors.
We find that self-updating models (SUMs) generally perform better than models that do not attempt to self-update when presented with additional previously-unseen data.
arXiv Detail & Related papers (2020-05-19T23:09:38Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.