Does imputation matter? Benchmark for predictive models
- URL: http://arxiv.org/abs/2007.02837v1
- Date: Mon, 6 Jul 2020 15:47:36 GMT
- Title: Does imputation matter? Benchmark for predictive models
- Authors: Katarzyna Wo\'znica and Przemys{\l}aw Biecek
- Abstract summary: This paper systematically evaluates the empirical effectiveness of data imputation algorithms for predictive models.
The main contributions are (1) the recommendation of a general method for empirical benchmarking based on real-life classification tasks.
- Score: 5.802346990263708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incomplete data are common in practical applications. Most predictive machine
learning models do not handle missing values so they require some
preprocessing. Although many algorithms are used for data imputation, we do not
understand the impact of the different methods on the predictive models'
performance. This paper is first that systematically evaluates the empirical
effectiveness of data imputation algorithms for predictive models. The main
contributions are (1) the recommendation of a general method for empirical
benchmarking based on real-life classification tasks and the (2) comparative
analysis of different imputation methods for a collection of data sets and a
collection of ML algorithms.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Assessing the Generalizability of a Performance Predictive Model [0.6070952062639761]
We propose a workflow to estimate the generalizability of a predictive model for algorithm performance.
The results show that generalizability patterns in the landscape feature space are reflected in the performance space.
arXiv Detail & Related papers (2023-05-31T12:50:44Z) - A Comparison of Modeling Preprocessing Techniques [0.0]
This paper compares the performance of various data processing methods in terms of predictive performance for structured data.
Three data sets of various structures, interactions, and complexity were constructed.
We compare several methods for feature selection, categorical handling, and null imputation.
arXiv Detail & Related papers (2023-02-23T14:11:08Z) - Machine Learning Capability: A standardized metric using case difficulty
with applications to individualized deployment of supervised machine learning [2.2060666847121864]
Model evaluation is a critical component in supervised machine learning classification analyses.
Items Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results.
arXiv Detail & Related papers (2023-02-09T00:38:42Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - A Comparison of Methods for Treatment Assignment with an Application to
Playlist Generation [13.804332504576301]
We group the various methods proposed in the literature into three general classes of algorithms (or metalearners)
We show analytically and empirically that optimizing for the prediction of outcomes or causal effects is not the same as optimizing for treatment assignments.
This is the first comparison of the three different metalearners on a real-world application at scale.
arXiv Detail & Related papers (2020-04-24T04:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.