Modeling High-Dimensional Data with Unknown Cut Points: A Fusion
Penalized Logistic Threshold Regression
- URL: http://arxiv.org/abs/2202.08441v1
- Date: Thu, 17 Feb 2022 04:16:40 GMT
- Title: Modeling High-Dimensional Data with Unknown Cut Points: A Fusion
Penalized Logistic Threshold Regression
- Authors: Yinan Lin, Wen Zhou, Zhi Geng, Gexin Xiao, and Jianxin Yin
- Abstract summary: In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors.
We consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses.
We find the lasso model is well suited in the problem of early detection and prediction for chronic disease like diabetes.
- Score: 2.520538806201793
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In traditional logistic regression models, the link function is often assumed
to be linear and continuous in predictors. Here, we consider a threshold model
that all continuous features are discretized into ordinal levels, which further
determine the binary responses. Both the threshold points and regression
coefficients are unknown and to be estimated. For high dimensional data, we
propose a fusion penalized logistic threshold regression (FILTER) model, where
a fused lasso penalty is employed to control the total variation and shrink the
coefficients to zero as a method of variable selection. Under mild conditions
on the estimate of unknown threshold points, we establish the non-asymptotic
error bound for coefficient estimation and the model selection consistency.
With a careful characterization of the error propagation, we have also shown
that the tree-based method, such as CART, fulfill the threshold estimation
conditions. We find the FILTER model is well suited in the problem of early
detection and prediction for chronic disease like diabetes, using physical
examination data. The finite sample behavior of our proposed method are also
explored and compared with extensive Monte Carlo studies, which supports our
theoretical discoveries.
Related papers
- Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data [8.619243141968886]
We present an inference framework for estimating regression coefficients in conditional mean models.
We develop an augmented inverse probability weighted (AIPW) method, employing regularized estimators for both propensity score (PS) and outcome regression (OR) models.
Our theoretical findings are verified through extensive simulation studies and a real-world data application.
arXiv Detail & Related papers (2024-06-20T00:34:54Z) - Sample Complexity Bounds for Score-Matching: Causal Discovery and
Generative Modeling [82.36856860383291]
We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network.
We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery method.
arXiv Detail & Related papers (2023-10-27T13:09:56Z) - Convergence of uncertainty estimates in Ensemble and Bayesian sparse
model discovery [4.446017969073817]
We show empirical success in terms of accuracy and robustness to noise with bootstrapping-based sequential thresholding least-squares estimator.
We show that this bootstrapping-based ensembling technique can perform a provably correct variable selection procedure with an exponential convergence rate of the error rate.
arXiv Detail & Related papers (2023-01-30T04:07:59Z) - The Implicit Delta Method [61.36121543728134]
In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of uncertainty.
We show that the change in the evaluation due to regularization is consistent for the variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference.
arXiv Detail & Related papers (2022-11-11T19:34:17Z) - Adaptive LASSO estimation for functional hidden dynamic geostatistical
model [69.10717733870575]
We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hiddenstatistical models (f-HD)
The algorithm is based on iterative optimisation and uses an adaptive least absolute shrinkage and selector operator (GMSOLAS) penalty function, wherein the weights are obtained by the unpenalised f-HD maximum-likelihood estimators.
arXiv Detail & Related papers (2022-08-10T19:17:45Z) - Benign-Overfitting in Conditional Average Treatment Effect Prediction
with Linear Regression [14.493176427999028]
We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE) with linear regression models.
We show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known.
arXiv Detail & Related papers (2022-02-10T18:51:52Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Statistical Inference for High-Dimensional Linear Regression with
Blockwise Missing Data [13.48481978963297]
Blockwise missing data occurs when we integrate multisource or multimodality data where different sources or modalities contain complementary information.
We propose a computationally efficient estimator for the regression coefficient vector based on carefully constructed unbiased estimating equations.
Numerical studies and application analysis of the Alzheimer's Disease Neuroimaging Initiative data show that the proposed method performs better and benefits more from unsupervised samples than existing methods.
arXiv Detail & Related papers (2021-06-07T05:12:42Z) - Calibration of prediction rules for life-time outcomes using prognostic
Cox regression survival models and multiple imputations to account for
missing predictor data with cross-validatory assessment [0.0]
Methods are described to combine imputation with predictive calibration in survival modeling subject to censoring.
Prediction-averaging appears to have superior statistical properties, especially smaller predictive variation, as opposed to a direct application of Rubin's rules.
arXiv Detail & Related papers (2021-05-04T20:10:12Z) - Leveraging Global Parameters for Flow-based Neural Posterior Estimation [90.21090932619695]
Inferring the parameters of a model based on experimental observations is central to the scientific method.
A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations.
We present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters.
arXiv Detail & Related papers (2021-02-12T12:23:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.