Related papers: The Missing Indicator Method: From Low to High Dimensions

The Missing Indicator Method: From Low to High Dimensions

URL: http://arxiv.org/abs/2211.09259v1
Date: Wed, 16 Nov 2022 23:10:45 GMT
Title: The Missing Indicator Method: From Low to High Dimensions
Authors: Mike Van Ness, Tomas M. Bosschieter, Roberto Halpin-Gregorio, Madeleine Udell
Abstract summary: Missing data is common in applied data science, particularly in healthcare, social sciences, and natural sciences. For data sets with informative missing patterns, the Missing Indicator Method (MIM) can be used in conjunction with imputation to improve model performance. We show experimentally that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models for uninformative missing values. We introduce Selective MIM, a method that adds missing indicators only for features that have informative missing patterns.
Score: 16.899237833310064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods work only on complete data, thus requiring preprocessing, such as missing value imputation, to work on incomplete data sets. However, imputation discards potentially useful information encoded by the pattern of missing values. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. We show experimentally that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Nonetheless, MIM can increase variance if many of the added indicators are uninformative, causing harm particularly for high-dimensional data sets. To address this issue, we introduce Selective MIM (SMIM), a method that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM across a range of experimental settings, and improves MIM for high-dimensional data.

Related papers

Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support [8.863778901027061]
A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages.<n>We develop a new characterization for the full data law in graphical models of missing data.<n>We show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR.
arXiv Detail & Related papers (2025-07-21T23:18:36Z)
Interpretable Generalized Additive Models for Datasets with Missing Values [17.123919441593152]
M-GAM is a sparse, generalized, additive modeling approach that incorporates missingness indicators and their interaction terms. We show that M-GAM provides similar or superior accuracy to prior methods while significantly improving sparsity relative to either imputation or naive inclusion of indicator variables.
arXiv Detail & Related papers (2024-12-03T18:21:20Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
M$^3$-Impute: Mask-guided Representation Learning for Missing Value Imputation [12.174699459648842]
M$3$-Impute aims to explicitly leverage the missingness information and such correlations with novel masking schemes. Experiment results show the effectiveness of M$3$-Impute by achieving 20 best and 4 second-best MAE scores on average.
arXiv Detail & Related papers (2024-10-11T13:25:32Z)
DiffPuter: Empowering Diffusion Models for Missing Data Imputation [56.48119008663155]
This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.<n>Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density.<n>Our experiments show that DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.
arXiv Detail & Related papers (2024-05-31T08:35:56Z)
On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets [0.0]
Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE) The results show that Missforest imputation performs the best followed by MICE imputation.
arXiv Detail & Related papers (2024-03-13T18:07:17Z)
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice. We investigate whether we can go beyond human data on tasks where we have access to scalar feedback. We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z)
To Impute or not to Impute? -- Missing Data in Treatment Effect Estimation [84.76186111434818]
We identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. We show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not.
arXiv Detail & Related papers (2022-02-04T12:08:31Z)
Model-based Clustering with Missing Not At Random Data [0.8777702580252754]
We propose model-based clustering algorithms designed to handle very general types of missing data, including MNAR data. Several MNAR models are discussed, for which the cause of the missingness can depend on both the values of the missing variable themselves and on the class membership. We focus on a specific MNAR model, called MNARz, for which the missingness only depends on the class membership.
arXiv Detail & Related papers (2021-12-20T09:52:12Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
VAEs in the Presence of Missing Data [6.397263087026567]
We develop a novel latent variable model of a corruption process which generates missing data, and derive a corresponding tractable evidence lower bound (ELBO) Our model is straightforward to implement, can handle both missing completely at random (MCAR) and missing not at random (MNAR) data, scales to high dimensional inputs and gives both the VAE encoder and decoder access to indicator variables for whether a data element is missing or not. On the MNIST and SVHN datasets we demonstrate improved marginal log-likelihood of observed data and better missing data imputation, compared to existing approaches.
arXiv Detail & Related papers (2020-06-09T14:40:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.