RIFLE: Imputation and Robust Inference from Low Order Marginals
- URL: http://arxiv.org/abs/2109.00644v3
- Date: Wed, 13 Sep 2023 00:17:41 GMT
- Title: RIFLE: Imputation and Robust Inference from Low Order Marginals
- Authors: Sina Baharlouei, Kelechi Ogudu, Sze-chuan Suen, Meisam Razaviyayn
- Abstract summary: We develop a statistical inference framework for regression and classification in the presence of missing data without imputation.
Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model.
Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
- Score: 10.082738539201804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ubiquity of missing values in real-world datasets poses a challenge for
statistical inference and can prevent similar datasets from being analyzed in
the same study, precluding many existing datasets from being used for new
analyses. While an extensive collection of packages and algorithms have been
developed for data imputation, the overwhelming majority perform poorly if
there are many missing values and low sample sizes, which are unfortunately
common characteristics in empirical data. Such low-accuracy estimations
adversely affect the performance of downstream statistical models. We develop a
statistical inference framework for regression and classification in the
presence of missing data without imputation. Our framework, RIFLE (Robust
InFerence via Low-order moment Estimations), estimates low-order moments of the
underlying data distribution with corresponding confidence intervals to learn a
distributionally robust model. We specialize our framework to linear regression
and normal discriminant analysis, and we provide convergence and performance
guarantees. This framework can also be adapted to impute missing data. In
numerical experiments, we compare RIFLE to several state-of-the-art approaches
(including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) for
imputation and inference in the presence of missing values. Our experiments
demonstrate that RIFLE outperforms other benchmark algorithms when the
percentage of missing values is high and/or when the number of data points is
relatively small. RIFLE is publicly available at
https://github.com/optimization-for-data-driven-science/RIFLE.
Related papers
- Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches [11.048092826888412]
This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework.
We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation.
arXiv Detail & Related papers (2024-06-19T20:20:30Z) - On the Performance of Empirical Risk Minimization with Smoothed Data [59.3428024282545]
Empirical Risk Minimization (ERM) is able to achieve sublinear error whenever a class is learnable with iid data.
We show that ERM is able to achieve sublinear error whenever a class is learnable with iid data.
arXiv Detail & Related papers (2024-02-22T21:55:41Z) - Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation
for Time Series [49.992908221544624]
Time series data often exhibit numerous missing values, which is the time series imputation task.
Previous deep learning methods have been shown to be effective for time series imputation.
We propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty.
arXiv Detail & Related papers (2023-12-03T05:52:30Z) - IRTCI: Item Response Theory for Categorical Imputation [5.9952530228468754]
Several imputation techniques have been designed to replace missing data with stand in values.
The work showcased here offers a novel means for categorical imputation based on item response theory (IRT)
Analyses comparing these techniques were performed on three different datasets.
arXiv Detail & Related papers (2023-02-08T16:17:20Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Distributionally Robust Multi-Output Regression Ranking [3.9318191265352196]
We introduce a new listwise listwise learning-to-rank model called Distributionally Robust Multi-output Regression Ranking (DRMRR)
DRMRR uses a Distributionally Robust Optimization framework to minimize a multi-output loss function under the most adverse distributions in the neighborhood of the empirical data distribution.
Our experiments were conducted on two real-world applications, medical document retrieval, and drug response prediction.
arXiv Detail & Related papers (2021-09-27T05:19:27Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Distributed Learning of Finite Gaussian Mixtures [21.652015112462]
We study split-and-conquer approaches for the distributed learning of finite Gaussian mixtures.
New estimator is shown to be consistent and retains root-n consistency under some general conditions.
Experiments based on simulated and real-world data show that the proposed split-and-conquer approach has comparable statistical performance with the global estimator.
arXiv Detail & Related papers (2020-10-20T16:17:47Z) - Matrix Completion with Quantified Uncertainty through Low Rank Gaussian
Copula [30.84155327760468]
This paper proposes a framework for missing value imputation with quantified uncertainty.
The time required to fit the model scales linearly with the number of rows and the number of columns in the dataset.
Empirical results show the method yields state-of-the-art imputation accuracy across a wide range of data types.
arXiv Detail & Related papers (2020-06-18T19:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.