A Unified Framework for Inference with General Missingness Patterns and Machine Learning Imputation
- URL: http://arxiv.org/abs/2508.15162v1
- Date: Thu, 21 Aug 2025 01:59:59 GMT
- Title: A Unified Framework for Inference with General Missingness Patterns and Machine Learning Imputation
- Authors: Xingran Chen, Tyler McCormick, Bhramar Mukherjee, Zhenke Wu,
- Abstract summary: This paper develops a novel method which delivers valid statistical inference framework for general Z-estimation problems.<n>We provide theoretical guarantees of normality of the proposed estimator and efficiency dominance over weighted complete-case analyses.
- Score: 12.817707155207817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained machine learning (ML) predictions have been increasingly used to complement incomplete data to enable downstream scientific inquiries, but their naive integration risks biased inferences. Recently, multiple methods have been developed to provide valid inference with ML imputations regardless of prediction quality and to enhance efficiency relative to complete-case analyses. However, existing approaches are often limited to missing outcomes under a missing-completely-at-random (MCAR) assumption, failing to handle general missingness patterns under the more realistic missing-at-random (MAR) assumption. This paper develops a novel method which delivers valid statistical inference framework for general Z-estimation problems using ML imputations under the MAR assumption and for general missingness patterns. The core technical idea is to stratify observations by distinct missingness patterns and construct an estimator by appropriately weighting and aggregating pattern-specific information through a masking-and-imputation procedure on the complete cases. We provide theoretical guarantees of asymptotic normality of the proposed estimator and efficiency dominance over weighted complete-case analyses. Practically, the method affords simple implementations by leveraging existing weighted complete-case analysis software. Extensive simulations are carried out to validate theoretical results. The paper concludes with a brief discussion on practical implications, limitations, and potential future directions.
Related papers
- Imputation-Powered Inference [2.676349883103404]
imputation-powered inference (IPI) is a model-lean framework that combines the flexibility of blackbox imputation with bias correction using fully observed data.<n>IPI may substantially improve subpopulation efficiency relative to complete-case analysis, while maintaining statistical validity in settings where both doubly robust estimators and naive imputation fail to achieve nominal coverage.
arXiv Detail & Related papers (2025-09-17T07:48:54Z) - Multiply Robust Conformal Risk Control with Coarsened Data [0.0]
Conformal Prediction (CP) has recently received a tremendous amount of interest.<n>In this paper, we consider the general problem of obtaining distribution-free valid prediction regions for an outcome given coarsened data.<n>Our principled use of semiparametric theory has the key advantage of facilitating flexible machine learning methods.
arXiv Detail & Related papers (2025-08-21T12:14:44Z) - Uncertainty Quantification for Machine Learning-Based Prediction: A Polynomial Chaos Expansion Approach for Joint Model and Input Uncertainty Propagation [1.223779595809275]
This paper presents a robust framework based on Polynomial Chaos Expansion (PCE) to handle joint input and model uncertainty propagation.<n>By transforming all random inputs into a unified standard space, a PCE surrogate model is constructed, allowing efficient and accurate calculation of the mean and standard deviation of the output.
arXiv Detail & Related papers (2025-07-20T01:47:50Z) - MIRRAMS: Learning Robust Tabular Models under Unseen Missingness Shifts [2.5357049657770516]
Missing values often reflect variations in data collection policies, which may shift across time or locations.<n>Such shifts in the missingness distribution between training and test inputs pose a significant challenge to achieving robust predictive performance.<n>We propose a novel deep learning framework designed to address this challenge, particularly in the common yet challenging scenario where the test-time dataset is unseen.
arXiv Detail & Related papers (2025-07-11T03:03:30Z) - Prediction-Powered E-Values [0.66567375919026]
We apply ideas of prediction-powered inference to e-values.<n>We show that every inference procedure that can be framed in terms of e-values has a prediction-powered counterpart.<n>Our approach is modular and easily integrable into existing algorithms.
arXiv Detail & Related papers (2025-02-06T18:36:01Z) - Testing Generalizability in Causal Inference [3.547529079746247]
No formal procedure exists for statistically evaluating generalizability in machine learning algorithms.<n>We propose a systematic framework for statistically evaluating the generalizability of high-dimensional causal inference models.
arXiv Detail & Related papers (2024-11-05T11:44:00Z) - Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z) - Task-Agnostic Machine-Learning-Assisted Inference [0.0]
We introduce a novel statistical framework named PSPS for task-agnostic ML-assisted inference.
PSPS provides a post-prediction inference solution that can be easily plugged into almost any established data analysis routines.
arXiv Detail & Related papers (2024-05-30T13:19:49Z) - Weak Supervision Performance Evaluation via Partial Identification [46.73061437177238]
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels.
We present a novel method to address this challenge by framing model evaluation as a partial identification problem.
Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.
arXiv Detail & Related papers (2023-12-07T07:15:11Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL)
We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately.
Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Non-Linear Spectral Dimensionality Reduction Under Uncertainty [107.01839211235583]
We propose a new dimensionality reduction framework, called NGEU, which leverages uncertainty information and directly extends several traditional approaches.
We show that the proposed NGEU formulation exhibits a global closed-form solution, and we analyze, based on the Rademacher complexity, how the underlying uncertainties theoretically affect the generalization ability of the framework.
arXiv Detail & Related papers (2022-02-09T19:01:33Z) - Amortized Conditional Normalized Maximum Likelihood: Reliable Out of
Distribution Uncertainty Estimation [99.92568326314667]
We propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation.
Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle.
We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.
arXiv Detail & Related papers (2020-11-05T08:04:34Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.