Benchmarking missing-values approaches for predictive models on health
databases
- URL: http://arxiv.org/abs/2202.10580v1
- Date: Thu, 17 Feb 2022 09:40:04 GMT
- Title: Benchmarking missing-values approaches for predictive models on health
databases
- Authors: Alexandre Perez-Lebel (MNI, MILA, PARIETAL), Ga\"el Varoquaux (MNI,
MILA, PARIETAL), Marine Le Morvan (PARIETAL), Julie Josse (CRISAM, IDESP),
Jean-Baptiste Poline (MNI)
- Abstract summary: We conduct a benchmark of missing-values strategies in predictive models with a focus on large health databases.
We find that native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost.
- Score: 47.187609203210705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BACKGROUND: As databases grow larger, it becomes harder to fully control
their collection, and they frequently come with missing values: incomplete
observations. These large databases are well suited to train machine-learning
models, for instance for forecasting or to extract biomarkers in biomedical
settings. Such predictive approaches can use discriminative -- rather than
generative -- modeling, and thus open the door to new missing-values
strategies. Yet existing empirical evaluations of strategies to handle missing
values have focused on inferential statistics. RESULTS: Here we conduct a
systematic benchmark of missing-values strategies in predictive models with a
focus on large health databases: four electronic health record datasets, a
population brain imaging one, a health survey and two intensive care ones.
Using gradient-boosted trees, we compare native support for missing values with
simple and state-of-the-art imputation prior to learning. We investigate
prediction accuracy and computational time. For prediction after imputation, we
find that adding an indicator to express which values have been imputed is
important, suggesting that the data are missing not at random. Elaborate
missing values imputation can improve prediction compared to simple strategies
but requires longer computational time on large data. Learning trees that model
missing values-with missing incorporated attribute-leads to robust, fast, and
well-performing predictive modeling. CONCLUSIONS: Native support for missing
values in supervised machine learning predicts better than state-of-the-art
imputation with much less computational cost. When using imputation, it is
important to add indicator columns expressing which values have been imputed.
Related papers
- Imputation for prediction: beware of diminishing returns [12.424671213282256]
Missing values are prevalent across various fields, posing challenges for training and deploying predictive models.
Recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive.
This study aims at clarifying if and when investing in advanced imputation methods yields significantly better predictions.
arXiv Detail & Related papers (2024-07-29T09:01:06Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - PROMISSING: Pruning Missing Values in Neural Networks [0.0]
We propose a simple and intuitive yet effective method for pruning missing values (PROMISSING) during learning and inference steps in neural networks.
Our experiments show that PROMISSING results in similar prediction performance compared to various imputation techniques.
arXiv Detail & Related papers (2022-06-03T15:37:27Z) - Minimax rate of consistency for linear models with missing values [0.0]
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...).
In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task.
This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets.
arXiv Detail & Related papers (2022-02-03T08:45:34Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Flexible Model Aggregation for Quantile Regression [92.63075261170302]
Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions.
We investigate methods for aggregating any number of conditional quantile models.
All of the models we consider in this paper can be fit using modern deep learning toolkits.
arXiv Detail & Related papers (2021-02-26T23:21:16Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - On the consistency of supervised learning with missing values [15.666860186278782]
In many application settings, the data have missing entries which make analysis challenging.
Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data.
We show that the widely-used method of imputing with a constant, such as the mean prior to learning, is consistent when missing values are not informative.
arXiv Detail & Related papers (2019-02-19T07:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.