Simple Imputation Rules for Prediction with Missing Data: Contrasting
Theoretical Guarantees with Empirical Performance
- URL: http://arxiv.org/abs/2104.03158v3
- Date: Fri, 2 Feb 2024 07:58:40 GMT
- Title: Simple Imputation Rules for Prediction with Missing Data: Contrasting
Theoretical Guarantees with Empirical Performance
- Authors: Dimitris Bertsimas, Arthur Delarue, Jean Pauphilet
- Abstract summary: Missing data is a common issue in real-world datasets.
This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence.
- Score: 7.642646077340124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing data is a common issue in real-world datasets. This paper studies the
performance of impute-then-regress pipelines by contrasting theoretical and
empirical evidence. We establish the asymptotic consistency of such pipelines
for a broad family of imputation methods. While common sense suggests that a
`good' imputation method produces datasets that are plausible, we show, on the
contrary, that, as far as prediction is concerned, crude can be good. Among
others, we find that mode-impute is asymptotically sub-optimal, while
mean-impute is asymptotically optimal. We then exhaustively assess the validity
of these theoretical conclusions on a large corpus of synthetic, semi-real, and
real datasets. While the empirical evidence we collect mostly supports our
theoretical findings, it also highlights gaps between theory and practice and
opportunities for future research, regarding the relevance of the MAR
assumption, the complex interdependency between the imputation and regression
tasks, and the need for realistic synthetic data generation models.
Related papers
- Causal Representation Learning from Multimodal Biological Observations [57.00712157758845]
We aim to develop flexible identification conditions for multimodal data.
We establish identifiability guarantees for each latent component, extending the subspace identification results from prior work.
Our key theoretical ingredient is the structural sparsity of the causal connections among distinct modalities.
arXiv Detail & Related papers (2024-11-10T16:40:27Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models [15.817239008727789]
In this work, we analyze a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain.
We show that recovering the latent Structural Causal Model (SCM) is unnecessary for estimating domain counterfactuals.
We also develop a theoretically grounded practical algorithm that simplifies the modeling process to generative model estimation.
arXiv Detail & Related papers (2023-06-20T04:19:06Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - Utility Theory of Synthetic Data Generation [12.511220449652384]
This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework.
It considers two utility metrics: generalization and ranking of models trained on synthetic data.
arXiv Detail & Related papers (2023-05-17T07:49:16Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z) - One Step to Efficient Synthetic Data [9.3000873953175]
A common approach to synthetic data is to sample from a fitted model.
We show that this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution.
Motivated by this, we propose a general method of producing synthetic data.
arXiv Detail & Related papers (2020-06-03T17:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.