Related papers: Learning Over Dirty Data Without Cleaning

Learning Over Dirty Data Without Cleaning

URL: http://arxiv.org/abs/2004.02308v1
Date: Sun, 5 Apr 2020 20:21:13 GMT
Title: Learning Over Dirty Data Without Cleaning
Authors: Jose Picado, John Davis, Arash Termehchy, Ga Young Lee
Abstract summary: Real-world datasets are dirty and contain many errors. Learning over dirty databases may result in inaccurate models. We propose DLearn, a novel relational learning system.
Score: 12.892359722606681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.

Related papers

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets [19.844836459291546]
High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning.
arXiv Detail & Related papers (2025-03-09T15:29:46Z)
SQL4NN: Validation and expressive querying of models as data [0.5530212768657544]
We consider machine learning models, learned from data, to be an important, intensional, kind of data in themselves. Various analysis tasks on models can be thought of as queries over this intensional data, often combined with extensional data such as data for training or validation.
arXiv Detail & Related papers (2025-02-20T17:16:10Z)
Certain and Approximately Certain Models for Statistical Learning [4.318959672085627]
We show that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary.
arXiv Detail & Related papers (2024-02-27T22:49:33Z)
Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations. These descriptions can be used to generate synthetic data using generative models, such as diffusion models. Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z)
Relational Deep Learning: Graph Representation Learning on Relational Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Repairing Systematic Outliers by Learning Clean Subspaces in VAEs [31.298063226774115]
We propose Clean Subspace Vari Autoencoder (VAE), a novel semi-supervised model for detection and automated repair of systematic errors. VAE is effective with much less labelled data compared to previous models, often with less than 2% of the data. We provide experiments using three image datasets in scenarios with different levels of corruption and labelled set sizes.
arXiv Detail & Related papers (2022-07-17T01:28:23Z)
An epistemic approach to model uncertainty in data-graphs [2.1261712640167847]
Graph databases can suffer from errors and discrepancies with respect to real-world data they intend to represent. In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases. We define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity.
arXiv Detail & Related papers (2021-09-29T00:08:27Z)
SSSE: Efficiently Erasing Samples from Trained Machine Learning Models [103.43466657962242]
We propose an efficient and effective algorithm, SSSE, for samples erasure. In certain cases SSSE can erase samples almost as well as the optimal, yet impractical, gold standard of training a new model from scratch with only the permitted data.
arXiv Detail & Related papers (2021-07-08T14:17:24Z)
On the Pitfalls of Learning with Limited Data: A Facial Expression Recognition Case Study [0.5249805590164901]
We focus on the problem of Facial Expression Recognition from videos. We performed an extensive study with four databases at a different complexity and nine deep-learning architectures for video classification. We found that complex training sets translate better to more stable test sets when trained with transfer learning and synthetically generated data.
arXiv Detail & Related papers (2021-04-02T18:53:41Z)
Self-Updating Models with Error Remediation [0.5156484100374059]
We propose a framework, Self-Updating Models with Error Remediation (SUMER), in which a deployed model updates itself as new data becomes available. A key component of SUMER is the notion of error remediation as self-labeled data can be susceptible to the propagation of errors. We find that self-updating models (SUMs) generally perform better than models that do not attempt to self-update when presented with additional previously-unseen data.
arXiv Detail & Related papers (2020-05-19T23:09:38Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.