Related papers: REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines

REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines

URL: http://arxiv.org/abs/2302.04702v1
Date: Thu, 9 Feb 2023 15:37:39 GMT
Title: REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines
Authors: Mohamed Abdelaal, Christian Hammacher, Harald Schoening
Abstract summary: We introduce a benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various machine learning models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world tabular data suffer from different types of discrepancies, such as missing values, outliers, duplicates, pattern violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines. To this end, the benchmark examines 38 simple and advanced error detection and repair methods. To evaluate these methods, we utilized a wide collection of ML models trained on 14 publicly-available datasets covering different domains and encompassing realistic as well as synthetic error profiles.

Related papers

Purifying, Labeling, and Utilizing: A High-Quality Pipeline for Small Object Detection [83.90563802153707]
PLUSNet is a high-quality Small object detection framework. It comprises three components: the Hierarchical Feature (HFP) framework for purifying upstream features, the Multiple Criteria Label Assignment (MCLA) for improving the quality of midstream training samples, and the Frequency Decoupled Head (FDHead) for more effectively exploiting information to accomplish downstream tasks.
arXiv Detail & Related papers (2025-04-29T10:11:03Z)
Are Large Language Models Good Data Preprocessors? [5.954202581988127]
High-quality textual training data is essential for the success of multimodal data processing tasks. outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods.
arXiv Detail & Related papers (2025-02-24T02:57:21Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z)
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z)
VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data [0.0]
This paper introduces VeML, a Version management system dedicated to end-to-end machine learning lifecycle. We address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently.
arXiv Detail & Related papers (2023-04-25T07:32:16Z)
Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data. Such mandates require effort both in terms of data as well as model retraining. continuous removal of data and model retraining steps do not scale. We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z)
Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines [27.461398584509755]
DataScope is the first system that efficiently computes Shapley values of training examples over an end-to-end machine learning pipeline. Our results show that DataScope is up to four orders of magnitude faster than state-of-the-art Monte Carlo-based methods.
arXiv Detail & Related papers (2022-04-23T19:29:23Z)
Machine Learning Model Drift Detection Via Weak Data Slices [5.319802998033767]
We propose a method that utilizes feature space rules, called data slices, for drift detection. We provide experimental indications that our method is likely to identify that the ML model will likely change in performance, based on changes in the underlying data.
arXiv Detail & Related papers (2021-08-11T16:55:34Z)
Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes [16.392844962056742]
We develop a corpus-driven approach to auto-validate emphmachine-generated data by inferring suitable data-validation "patterns" Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.
arXiv Detail & Related papers (2021-04-10T01:15:48Z)
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.