Related papers: The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models

The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models

URL: http://arxiv.org/abs/2310.04919v1
Date: Sat, 7 Oct 2023 21:16:09 GMT
Title: The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models
Authors: Yushu Shi and Michael Martens
Abstract summary: We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models. CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In modern scientific research, the objective is often to identify which variables are associated with an outcome among a large class of potential predictors. This goal can be achieved by selecting variables in a manner that controls the the false discovery rate (FDR), the proportion of irrelevant predictors among the selections. Knockoff filtering is a cutting-edge approach to variable selection that provides FDR control. Existing knockoff statistics frequently employ linear models to assess relationships between features and the response, but the linearity assumption is often violated in real world applications. This may result in poor power to detect truly prognostic variables. We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models, such as deep neural networks. The CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features. We illustrate the capability of the CPF statistics to provide superior power over common knockoff statistics with continuous, categorical, and survival outcomes using repeated simulations. Knockoff filtering with the CPF statistics is demonstrated using (1) a residential building dataset to select predictors for the actual sales prices and (2) the TCGA dataset to select genes that are correlated with disease staging in lung cancer patients.

Related papers

False Discovery Rate Control via Bayesian Mirror Statistic [0.0]
We adapt the Mirror Statistic approach to False Discovery Rate (FDR) control into a Bayesian modelling framework.<n>We propose to rely on a Bayesian formulation of the model and use the posterior distributions of the coefficients of interest to build the Mirror Statistic.<n>We keep the approach scalable to high-dimensions by relying on Automatic Differentiation Variational Inference.
arXiv Detail & Related papers (2025-10-01T13:24:50Z)
Model Correlation Detection via Random Selection Probing [62.093777777813756]
Existing similarity-based methods require access to model parameters or produce scores without thresholds.<n>We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test.<n>RSP produces rigorous p-values that quantify evidence of correlation.
arXiv Detail & Related papers (2025-09-29T01:40:26Z)
Probabilistic causal graphs as categorical data synthesizers: Do they do better than Gaussian Copulas and Conditional Tabular GANs? [0.0]
This study investigates the generation of high-quality synthetic categorical data, such as survey data, using causal graph models. We used the categorical data that are based on the survey of accessibility to services for people with disabilities. We created both SEM and BN models to represent causal relationships and to capture joint distributions between variables.
arXiv Detail & Related papers (2025-04-15T18:41:54Z)
Regression-Based Estimation of Causal Effects in the Presence of Selection Bias and Confounding [52.1068936424622]
We consider the problem of estimating the expected causal effect $E[Y|do(X)]$ for a target variable $Y$ when treatment $X$ is set by intervention. In settings without selection bias or confounding, $E[Y|do(X)] = E[Y|X]$, which can be estimated using standard regression methods. We propose a framework that incorporates both selection bias and confounding.
arXiv Detail & Related papers (2025-03-26T13:43:37Z)
Evidential time-to-event prediction model with well-calibrated uncertainty estimation [12.446406577462069]
We introduce an evidential regression model designed especially for time-to-event prediction tasks. The most plausible event time is directly quantified by aggregated Gaussian random fuzzy numbers (GRFNs) Our model achieves both accurate and reliable performance, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-11-12T15:06:04Z)
Interval Estimation of Coefficients in Penalized Regression Models of Insurance Data [3.5637073151604093]
Tweedie exponential dispersion family is a popular choice among many to model insurance losses. It is often important to obtain credibility (inference) of the most important features that describe the endogenous variables.
arXiv Detail & Related papers (2024-10-01T18:57:18Z)
Error-based Knockoffs Inference for Controlled Feature Selection [49.99321384855201]
We propose an error-based knockoff inference method by integrating the knockoff features, the error-based feature importance statistics, and the stepdown procedure together. The proposed inference procedure does not require specifying a regression model and can handle feature selection with theoretical guarantees.
arXiv Detail & Related papers (2022-03-09T01:55:59Z)
Modeling High-Dimensional Data with Unknown Cut Points: A Fusion Penalized Logistic Threshold Regression [2.520538806201793]
In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors. We consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses. We find the lasso model is well suited in the problem of early detection and prediction for chronic disease like diabetes.
arXiv Detail & Related papers (2022-02-17T04:16:40Z)
Uncertainty Modeling for Out-of-Distribution Generalization [56.957731893992495]
We argue that the feature statistics can be properly manipulated to improve the generalization ability of deep learning models. Common methods often consider the feature statistics as deterministic values measured from the learned features. We improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.
arXiv Detail & Related papers (2022-02-08T16:09:12Z)
When in Doubt: Neural Non-Parametric Uncertainty Quantification for Epidemic Forecasting [70.54920804222031]
Most existing forecasting models disregard uncertainty quantification, resulting in mis-calibrated predictions. Recent works in deep neural models for uncertainty-aware time-series forecasting also have several limitations. We model the forecasting task as a probabilistic generative process and propose a functional neural process model called EPIFNP.
arXiv Detail & Related papers (2021-06-07T18:31:47Z)
Multivariate Probabilistic Regression with Natural Gradient Boosting [63.58097881421937]
We propose a Natural Gradient Boosting (NGBoost) approach based on nonparametrically modeling the conditional parameters of the multivariate predictive distribution. Our method is robust, works out-of-the-box without extensive tuning, is modular with respect to the assumed target distribution, and performs competitively in comparison to existing approaches.
arXiv Detail & Related papers (2021-06-07T17:44:49Z)
SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets. Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z)
Curse of Small Sample Size in Forecasting of the Active Cases in COVID-19 Outbreak [0.0]
During the COVID-19 pandemic, a massive number of attempts on the predictions of the number of cases and the other future trends of this pandemic have been made. However, they fail to predict, in a reliable way, the medium and long term evolution of fundamental features of COVID-19 outbreak within acceptable accuracy. This paper gives an explanation for the failure of machine learning models in this particular forecasting problem.
arXiv Detail & Related papers (2020-11-06T23:13:34Z)
Causal Transfer Random Forest: Combining Logged Data and Randomized Experiments for Robust Prediction [8.736551469632758]
We describe a causal transfer random forest (CTRF) that combines existing training data with a small amount of data from a randomized experiment to train a model. We evaluate the CTRF using both synthetic data experiments and real-world experiments in the Bing Ads platform.
arXiv Detail & Related papers (2020-10-17T03:54:37Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.