The Conditional Prediction Function: A Novel Technique to Control False
Discovery Rate for Complex Models
- URL: http://arxiv.org/abs/2310.04919v1
- Date: Sat, 7 Oct 2023 21:16:09 GMT
- Title: The Conditional Prediction Function: A Novel Technique to Control False
Discovery Rate for Complex Models
- Authors: Yushu Shi and Michael Martens
- Abstract summary: We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models.
CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In modern scientific research, the objective is often to identify which
variables are associated with an outcome among a large class of potential
predictors. This goal can be achieved by selecting variables in a manner that
controls the the false discovery rate (FDR), the proportion of irrelevant
predictors among the selections. Knockoff filtering is a cutting-edge approach
to variable selection that provides FDR control. Existing knockoff statistics
frequently employ linear models to assess relationships between features and
the response, but the linearity assumption is often violated in real world
applications. This may result in poor power to detect truly prognostic
variables. We introduce a knockoff statistic based on the conditional
prediction function (CPF), which can pair with state-of-art machine learning
predictive models, such as deep neural networks. The CPF statistics can capture
the nonlinear relationships between predictors and outcomes while also
accounting for correlation between features. We illustrate the capability of
the CPF statistics to provide superior power over common knockoff statistics
with continuous, categorical, and survival outcomes using repeated simulations.
Knockoff filtering with the CPF statistics is demonstrated using (1) a
residential building dataset to select predictors for the actual sales prices
and (2) the TCGA dataset to select genes that are correlated with disease
staging in lung cancer patients.
Related papers
- False Discovery Rate Control via Bayesian Mirror Statistic [0.0]
We adapt the Mirror Statistic approach to False Discovery Rate (FDR) control into a Bayesian modelling framework.<n>We propose to rely on a Bayesian formulation of the model and use the posterior distributions of the coefficients of interest to build the Mirror Statistic.<n>We keep the approach scalable to high-dimensions by relying on Automatic Differentiation Variational Inference.
arXiv Detail & Related papers (2025-10-01T13:24:50Z) - Model Correlation Detection via Random Selection Probing [62.093777777813756]
Existing similarity-based methods require access to model parameters or produce scores without thresholds.<n>We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test.<n>RSP produces rigorous p-values that quantify evidence of correlation.
arXiv Detail & Related papers (2025-09-29T01:40:26Z) - Probabilistic causal graphs as categorical data synthesizers: Do they do better than Gaussian Copulas and Conditional Tabular GANs? [0.0]
This study investigates the generation of high-quality synthetic categorical data, such as survey data, using causal graph models.
We used the categorical data that are based on the survey of accessibility to services for people with disabilities.
We created both SEM and BN models to represent causal relationships and to capture joint distributions between variables.
arXiv Detail & Related papers (2025-04-15T18:41:54Z) - Regression-Based Estimation of Causal Effects in the Presence of Selection Bias and Confounding [52.1068936424622]
We consider the problem of estimating the expected causal effect $E[Y|do(X)]$ for a target variable $Y$ when treatment $X$ is set by intervention.
In settings without selection bias or confounding, $E[Y|do(X)] = E[Y|X]$, which can be estimated using standard regression methods.
We propose a framework that incorporates both selection bias and confounding.
arXiv Detail & Related papers (2025-03-26T13:43:37Z) - Evidential time-to-event prediction model with well-calibrated uncertainty estimation [12.446406577462069]
We introduce an evidential regression model designed especially for time-to-event prediction tasks.
The most plausible event time is directly quantified by aggregated Gaussian random fuzzy numbers (GRFNs)
Our model achieves both accurate and reliable performance, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-11-12T15:06:04Z) - Interval Estimation of Coefficients in Penalized Regression Models of Insurance Data [3.5637073151604093]
Tweedie exponential dispersion family is a popular choice among many to model insurance losses.
It is often important to obtain credibility (inference) of the most important features that describe the endogenous variables.
arXiv Detail & Related papers (2024-10-01T18:57:18Z) - Error-based Knockoffs Inference for Controlled Feature Selection [49.99321384855201]
We propose an error-based knockoff inference method by integrating the knockoff features, the error-based feature importance statistics, and the stepdown procedure together.
The proposed inference procedure does not require specifying a regression model and can handle feature selection with theoretical guarantees.
arXiv Detail & Related papers (2022-03-09T01:55:59Z) - Modeling High-Dimensional Data with Unknown Cut Points: A Fusion
Penalized Logistic Threshold Regression [2.520538806201793]
In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors.
We consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses.
We find the lasso model is well suited in the problem of early detection and prediction for chronic disease like diabetes.
arXiv Detail & Related papers (2022-02-17T04:16:40Z) - Uncertainty Modeling for Out-of-Distribution Generalization [56.957731893992495]
We argue that the feature statistics can be properly manipulated to improve the generalization ability of deep learning models.
Common methods often consider the feature statistics as deterministic values measured from the learned features.
We improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.
arXiv Detail & Related papers (2022-02-08T16:09:12Z) - When in Doubt: Neural Non-Parametric Uncertainty Quantification for
Epidemic Forecasting [70.54920804222031]
Most existing forecasting models disregard uncertainty quantification, resulting in mis-calibrated predictions.
Recent works in deep neural models for uncertainty-aware time-series forecasting also have several limitations.
We model the forecasting task as a probabilistic generative process and propose a functional neural process model called EPIFNP.
arXiv Detail & Related papers (2021-06-07T18:31:47Z) - Multivariate Probabilistic Regression with Natural Gradient Boosting [63.58097881421937]
We propose a Natural Gradient Boosting (NGBoost) approach based on nonparametrically modeling the conditional parameters of the multivariate predictive distribution.
Our method is robust, works out-of-the-box without extensive tuning, is modular with respect to the assumed target distribution, and performs competitively in comparison to existing approaches.
arXiv Detail & Related papers (2021-06-07T17:44:49Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Curse of Small Sample Size in Forecasting of the Active Cases in
COVID-19 Outbreak [0.0]
During the COVID-19 pandemic, a massive number of attempts on the predictions of the number of cases and the other future trends of this pandemic have been made.
However, they fail to predict, in a reliable way, the medium and long term evolution of fundamental features of COVID-19 outbreak within acceptable accuracy.
This paper gives an explanation for the failure of machine learning models in this particular forecasting problem.
arXiv Detail & Related papers (2020-11-06T23:13:34Z) - Causal Transfer Random Forest: Combining Logged Data and Randomized
Experiments for Robust Prediction [8.736551469632758]
We describe a causal transfer random forest (CTRF) that combines existing training data with a small amount of data from a randomized experiment to train a model.
We evaluate the CTRF using both synthetic data experiments and real-world experiments in the Bing Ads platform.
arXiv Detail & Related papers (2020-10-17T03:54:37Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.