REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in
ML Pipelines
- URL: http://arxiv.org/abs/2302.04702v1
- Date: Thu, 9 Feb 2023 15:37:39 GMT
- Title: REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in
ML Pipelines
- Authors: Mohamed Abdelaal, Christian Hammacher, Harald Schoening
- Abstract summary: We introduce a benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various machine learning models.
Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, machine learning (ML) plays a vital role in many aspects of our
daily life. In essence, building well-performing ML applications requires the
provision of high-quality data throughout the entire life-cycle of such
applications. Nevertheless, most of the real-world tabular data suffer from
different types of discrepancies, such as missing values, outliers, duplicates,
pattern violation, and inconsistencies. Such discrepancies typically emerge
while collecting, transferring, storing, and/or integrating the data. To deal
with these discrepancies, numerous data cleaning methods have been introduced.
However, the majority of such methods broadly overlook the requirements imposed
by downstream ML models. As a result, the potential of utilizing these data
cleaning methods in ML pipelines is predominantly unrevealed. In this work, we
introduce a comprehensive benchmark, called REIN1, to thoroughly investigate
the impact of data cleaning methods on various ML models. Through the
benchmark, we provide answers to important research questions, e.g., where and
whether data cleaning is a necessary step in ML pipelines. To this end, the
benchmark examines 38 simple and advanced error detection and repair methods.
To evaluate these methods, we utilized a wide collection of ML models trained
on 14 publicly-available datasets covering different domains and encompassing
realistic as well as synthetic error profiles.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features.
This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks.
We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning
over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z) - VeML: An End-to-End Machine Learning Lifecycle for Large-scale and
High-dimensional Data [0.0]
This paper introduces VeML, a Version management system dedicated to end-to-end machine learning lifecycle.
We address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset.
We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently.
arXiv Detail & Related papers (2023-04-25T07:32:16Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - Data Debugging with Shapley Importance over End-to-End Machine Learning
Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline.
Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z) - Machine Learning Model Drift Detection Via Weak Data Slices [5.319802998033767]
We propose a method that utilizes feature space rules, called data slices, for drift detection.
We provide experimental indications that our method is likely to identify that the ML model will likely change in performance, based on changes in the underlying data.
arXiv Detail & Related papers (2021-08-11T16:55:34Z) - Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns
Inferred from Data Lakes [16.392844962056742]
We develop a corpus-driven approach to auto-validate emphmachine-generated data by inferring suitable data-validation "patterns"
Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.
arXiv Detail & Related papers (2021-04-10T01:15:48Z) - PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.
PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.