Impact of Missing Values in Machine Learning: A Comprehensive Analysis
- URL: http://arxiv.org/abs/2410.08295v1
- Date: Thu, 10 Oct 2024 18:31:44 GMT
- Title: Impact of Missing Values in Machine Learning: A Comprehensive Analysis
- Authors: Abu Fuad Ahmad, Md Shohel Sayeed, Khaznah Alshammari, Istiaque Ahmed,
- Abstract summary: This paper aims to examine the nuanced impact of missing values on machine learning (ML) models.
Our analysis focuses on the challenges posed by missing values, including biased inferences, reduced predictive power, and increased computational burdens.
The study employs case studies and real-world examples to illustrate the practical implications of addressing missing values.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) has become a ubiquitous tool across various domains of data mining and big data analysis. The efficacy of ML models depends heavily on high-quality datasets, which are often complicated by the presence of missing values. Consequently, the performance and generalization of ML models are at risk in the face of such datasets. This paper aims to examine the nuanced impact of missing values on ML workflows, including their types, causes, and consequences. Our analysis focuses on the challenges posed by missing values, including biased inferences, reduced predictive power, and increased computational burdens. The paper further explores strategies for handling missing values, including imputation techniques and removal strategies, and investigates how missing values affect model evaluation metrics and introduces complexities in cross-validation and model selection. The study employs case studies and real-world examples to illustrate the practical implications of addressing missing values. Finally, the discussion extends to future research directions, emphasizing the need for handling missing values ethically and transparently. The primary goal of this paper is to provide insights into the pervasive impact of missing values on ML models and guide practitioners toward effective strategies for achieving robust and reliable model outcomes.
Related papers
- Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - Explainability of Machine Learning Models under Missing Data [2.880748930766428]
Missing data is a prevalent issue that can significantly impair model performance and interpretability.
This paper briefly summarizes the development of the field of missing data and investigates the effects of various imputation methods on the calculation of Shapley values.
arXiv Detail & Related papers (2024-06-29T11:31:09Z) - Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection.
We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets.
We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Metric Tools for Sensitivity Analysis with Applications to Neural
Networks [0.0]
Explainable Artificial Intelligence (XAI) aims to provide interpretations for predictions made by Machine Learning models.
In this paper, a theoretical framework is proposed to study sensitivities of ML models using metric techniques.
A complete family of new quantitative metrics called $alpha$-curves is extracted.
arXiv Detail & Related papers (2023-05-03T18:10:21Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z) - Data and Model Dependencies of Membership Inference Attack [13.951470844348899]
We provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA.
Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use.
We propose using those data and model properties as regularizers to protect ML models against MIA.
arXiv Detail & Related papers (2020-02-17T09:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.