RIFLE: Imputation and Robust Inference from Low Order Marginals
- URL: http://arxiv.org/abs/2109.00644v3
- Date: Wed, 13 Sep 2023 00:17:41 GMT
- Title: RIFLE: Imputation and Robust Inference from Low Order Marginals
- Authors: Sina Baharlouei, Kelechi Ogudu, Sze-chuan Suen, Meisam Razaviyayn
- Abstract summary: We develop a statistical inference framework for regression and classification in the presence of missing data without imputation.
Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model.
Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
- Score: 10.082738539201804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ubiquity of missing values in real-world datasets poses a challenge for
statistical inference and can prevent similar datasets from being analyzed in
the same study, precluding many existing datasets from being used for new
analyses. While an extensive collection of packages and algorithms have been
developed for data imputation, the overwhelming majority perform poorly if
there are many missing values and low sample sizes, which are unfortunately
common characteristics in empirical data. Such low-accuracy estimations
adversely affect the performance of downstream statistical models. We develop a
statistical inference framework for regression and classification in the
presence of missing data without imputation. Our framework, RIFLE (Robust
InFerence via Low-order moment Estimations), estimates low-order moments of the
underlying data distribution with corresponding confidence intervals to learn a
distributionally robust model. We specialize our framework to linear regression
and normal discriminant analysis, and we provide convergence and performance
guarantees. This framework can also be adapted to impute missing data. In
numerical experiments, we compare RIFLE to several state-of-the-art approaches
(including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) for
imputation and inference in the presence of missing values. Our experiments
demonstrate that RIFLE outperforms other benchmark algorithms when the
percentage of missing values is high and/or when the number of data points is
relatively small. RIFLE is publicly available at
https://github.com/optimization-for-data-driven-science/RIFLE.
Related papers
- Performance of Cross-Validated Targeted Maximum Likelihood Estimation [0.0]
We investigate the performance of CVTMLE compared to TMLE in various settings.
CVTMLE vastly improves confidence interval coverage without adversely affecting bias.
We show through simulations that CVTMLE is much less sensitive to the choice of the super learner library.
arXiv Detail & Related papers (2024-09-17T15:15:03Z) - Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches [11.048092826888412]
This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework.
We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation.
arXiv Detail & Related papers (2024-06-19T20:20:30Z) - On the Performance of Empirical Risk Minimization with Smoothed Data [59.3428024282545]
Empirical Risk Minimization (ERM) is able to achieve sublinear error whenever a class is learnable with iid data.
We show that ERM is able to achieve sublinear error whenever a class is learnable with iid data.
arXiv Detail & Related papers (2024-02-22T21:55:41Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - IRTCI: Item Response Theory for Categorical Imputation [5.9952530228468754]
Several imputation techniques have been designed to replace missing data with stand in values.
The work showcased here offers a novel means for categorical imputation based on item response theory (IRT)
Analyses comparing these techniques were performed on three different datasets.
arXiv Detail & Related papers (2023-02-08T16:17:20Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Distributionally Robust Multi-Output Regression Ranking [3.9318191265352196]
We introduce a new listwise listwise learning-to-rank model called Distributionally Robust Multi-output Regression Ranking (DRMRR)
DRMRR uses a Distributionally Robust Optimization framework to minimize a multi-output loss function under the most adverse distributions in the neighborhood of the empirical data distribution.
Our experiments were conducted on two real-world applications, medical document retrieval, and drug response prediction.
arXiv Detail & Related papers (2021-09-27T05:19:27Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Matrix Completion with Quantified Uncertainty through Low Rank Gaussian
Copula [30.84155327760468]
This paper proposes a framework for missing value imputation with quantified uncertainty.
The time required to fit the model scales linearly with the number of rows and the number of columns in the dataset.
Empirical results show the method yields state-of-the-art imputation accuracy across a wide range of data types.
arXiv Detail & Related papers (2020-06-18T19:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.