Simple Imputation Rules for Prediction with Missing Data: Contrasting
Theoretical Guarantees with Empirical Performance
- URL: http://arxiv.org/abs/2104.03158v3
- Date: Fri, 2 Feb 2024 07:58:40 GMT
- Title: Simple Imputation Rules for Prediction with Missing Data: Contrasting
Theoretical Guarantees with Empirical Performance
- Authors: Dimitris Bertsimas, Arthur Delarue, Jean Pauphilet
- Abstract summary: Missing data is a common issue in real-world datasets.
This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence.
- Score: 7.642646077340124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing data is a common issue in real-world datasets. This paper studies the
performance of impute-then-regress pipelines by contrasting theoretical and
empirical evidence. We establish the asymptotic consistency of such pipelines
for a broad family of imputation methods. While common sense suggests that a
`good' imputation method produces datasets that are plausible, we show, on the
contrary, that, as far as prediction is concerned, crude can be good. Among
others, we find that mode-impute is asymptotically sub-optimal, while
mean-impute is asymptotically optimal. We then exhaustively assess the validity
of these theoretical conclusions on a large corpus of synthetic, semi-real, and
real datasets. While the empirical evidence we collect mostly supports our
theoretical findings, it also highlights gaps between theory and practice and
opportunities for future research, regarding the relevance of the MAR
assumption, the complex interdependency between the imputation and regression
tasks, and the need for realistic synthetic data generation models.
Related papers
- Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models [15.817239008727789]
In this work, we analyze a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain.
We show that recovering the latent Structural Causal Model (SCM) is unnecessary for estimating domain counterfactuals.
We also develop a theoretically grounded practical algorithm that simplifies the modeling process to generative model estimation.
arXiv Detail & Related papers (2023-06-20T04:19:06Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - Utility Theory of Synthetic Data Generation [14.061357975073319]
This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework.
It considers two utility metrics: generalization and ranking of models trained on synthetic data.
arXiv Detail & Related papers (2023-05-17T07:49:16Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Learnability of Competitive Threshold Models [11.005966612053262]
We study the learnability of the competitive threshold model from a theoretical perspective.
We demonstrate how competitive threshold models can be seamlessly simulated by artificial neural networks.
arXiv Detail & Related papers (2022-05-08T01:11:51Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z) - Enabling Counterfactual Survival Analysis with Balanced Representations [64.17342727357618]
Survival data are frequently encountered across diverse medical applications, i.e., drug development, risk profiling, and clinical trials.
We propose a theoretically grounded unified framework for counterfactual inference applicable to survival outcomes.
arXiv Detail & Related papers (2020-06-14T01:15:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.