Variable selection with missing data in both covariates and outcomes:
Imputation and machine learning
- URL: http://arxiv.org/abs/2104.02769v1
- Date: Tue, 6 Apr 2021 20:18:29 GMT
- Title: Variable selection with missing data in both covariates and outcomes:
Imputation and machine learning
- Authors: Liangyuan Hu and Jung-Yi Joyce Lin and Jiayi Ji
- Abstract summary: The missing data issue is ubiquitous in health studies.
Machine learning methods weaken parametric assumptions.
XGBoost and BART have the overall best performance across various settings.
- Score: 1.0333430439241666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The missing data issue is ubiquitous in health studies. Variable selection in
the presence of both missing covariates and outcomes is an important
statistical research topic but has been less studied. Existing literature
focuses on parametric regression techniques that provide direct parameter
estimates of the regression model. In practice, parametric regression models
are often sub-optimal for variable selection because they are susceptible to
misspecification. Machine learning methods considerably weaken the parametric
assumptions and increase modeling flexibility, but do not provide as naturally
defined variable importance measure as the covariate effect native to
parametric models. We investigate a general variable selection approach when
both the covariates and outcomes can be missing at random and have general
missing data patterns. This approach exploits the flexibility of machine
learning modeling techniques and bootstrap imputation, which is amenable to
nonparametric methods in which the covariate effects are not directly
available. We conduct expansive simulations investigating the practical
operating characteristics of the proposed variable selection approach, when
combined with four tree-based machine learning methods, XGBoost, Random
Forests, Bayesian Additive Regression Trees (BART) and Conditional Random
Forests, and two commonly used parametric methods, lasso and backward stepwise
selection. Numeric results show XGBoost and BART have the overall best
performance across various settings. Guidance for choosing methods appropriate
to the structure of the analysis data at hand are discussed. We further
demonstrate the methods via a case study of risk factors for 3-year incidence
of metabolic syndrome with data from the Study of Women's Health Across the
Nation.
Related papers
- Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Comparative Analysis of Data Preprocessing Methods, Feature Selection
Techniques and Machine Learning Models for Improved Classification and
Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets.
We found that outliers/skew in predictor or target variables did not pose a challenge to regression models.
We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Selective Nonparametric Regression via Testing [54.20569354303575]
We develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point.
Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor.
arXiv Detail & Related papers (2023-09-28T13:04:11Z) - Toward Physically Plausible Data-Driven Models: A Novel Neural Network
Approach to Symbolic Regression [2.7071541526963805]
This paper proposes a novel neural network-based symbolic regression method.
It constructs physically plausible models based on even very small training data sets and prior knowledge about the system.
We experimentally evaluate the approach on four test systems: the TurtleBot 2 mobile robot, the magnetic manipulation system, the equivalent resistance of two resistors in parallel, and the longitudinal force of the anti-lock braking system.
arXiv Detail & Related papers (2023-02-01T22:05:04Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Flexible variable selection in the presence of missing data [0.0]
We propose a non-parametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data.
We show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance.
arXiv Detail & Related papers (2022-02-25T21:41:03Z) - Nonparametric Functional Analysis of Generalized Linear Models Under
Nonlinear Constraints [0.0]
This article introduces a novel nonparametric methodology for Generalized Linear Models.
It combines the strengths of the binary regression and latent variable formulations for categorical data.
It extends recently published parametric versions of the methodology and generalizes it.
arXiv Detail & Related papers (2021-10-11T04:49:59Z) - An interpretable prediction model for longitudinal dispersion
coefficient in natural streams based on evolutionary symbolic regression
network [30.99493442296212]
Various methods have been proposed for predictions of longitudinal dispersion coefficient(LDC)
In this paper, we first present an in-depth analysis of those methods and find out their defects.
We then design a novel symbolic regression method called evolutionary symbolic regression network(ESRN)
arXiv Detail & Related papers (2021-06-17T07:06:05Z) - An Optimal Control Approach to Learning in SIDARTHE Epidemic model [67.22168759751541]
We propose a general approach for learning time-variant parameters of dynamic compartmental models from epidemic data.
We forecast the epidemic evolution in Italy and France.
arXiv Detail & Related papers (2020-10-28T10:58:59Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.