Random features models: a way to study the success of naive imputation
- URL: http://arxiv.org/abs/2402.03839v1
- Date: Tue, 6 Feb 2024 09:37:06 GMT
- Title: Random features models: a way to study the success of naive imputation
- Authors: Alexis Ayme (LPSM (UMR\_8001)), Claire Boyer (LPSM (UMR\_8001), IUF),
Aymeric Dieuleveut (CMAP), Erwan Scornet (LPSM (UMR\_8001))
- Abstract summary: Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data.
Recent works suggest that this bias is low in the context of high-dimensional linear predictors.
This paper confirms the intuition that the bias is negligible and that surprisingly naive imputation also remains relevant in very low dimension.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Constant (naive) imputation is still widely used in practice as this is a
first easy-to-use technique to deal with missing data. Yet, this simple method
could be expected to induce a large bias for prediction purposes, as the
imputed input may strongly differ from the true underlying data. However,
recent works suggest that this bias is low in the context of high-dimensional
linear predictors when data is supposed to be missing completely at random
(MCAR). This paper completes the picture for linear predictors by confirming
the intuition that the bias is negligible and that surprisingly naive
imputation also remains relevant in very low dimension.To this aim, we consider
a unique underlying random features model, which offers a rigorous framework
for studying predictive performances, whilst the dimension of the observed
features varies.Building on these theoretical results, we establish
finite-sample bounds on stochastic gradient (SGD) predictors applied to
zero-imputed data, a strategy particularly well suited for large-scale
learning.If the MCAR assumption appears to be strong, we show that similar
favorable behaviors occur for more complex missing data scenarios.
Related papers
- Correcting Model Bias with Sparse Implicit Processes [0.9187159782788579]
We show that Sparse Implicit Processes (SIP) is capable of correcting model bias when the data generating mechanism differs strongly from the one implied by the model.
We use synthetic datasets to show that SIP is capable of providing predictive distributions that reflect the data better than the exact predictions of the initial, but wrongly assumed model.
arXiv Detail & Related papers (2022-07-21T18:00:01Z) - Non-Volatile Memory Accelerated Posterior Estimation [3.4256231429537936]
Current machine learning models use only a single learnable parameter combination when making predictions.
We show that through the use of high-capacity persistent storage, models whose posterior distribution was too big to approximate are now feasible.
arXiv Detail & Related papers (2022-02-21T20:25:57Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Latent Gaussian Model Boosting [0.0]
Tree-boosting shows excellent predictive accuracy on many data sets.
We obtain increased predictive accuracy compared to existing approaches in both simulated and real-world data experiments.
arXiv Detail & Related papers (2021-05-19T07:36:30Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Improving Uncertainty Calibration via Prior Augmented Data [56.88185136509654]
Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators.
They are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions.
We propose a solution by seeking out regions of feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the prior distribution of the labels.
arXiv Detail & Related papers (2021-02-22T07:02:37Z) - Curse of Small Sample Size in Forecasting of the Active Cases in
COVID-19 Outbreak [0.0]
During the COVID-19 pandemic, a massive number of attempts on the predictions of the number of cases and the other future trends of this pandemic have been made.
However, they fail to predict, in a reliable way, the medium and long term evolution of fundamental features of COVID-19 outbreak within acceptable accuracy.
This paper gives an explanation for the failure of machine learning models in this particular forecasting problem.
arXiv Detail & Related papers (2020-11-06T23:13:34Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Ambiguity in Sequential Data: Predicting Uncertain Futures with
Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data.
We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.