Two-stage Hypothesis Tests for Variable Interactions with FDR Control
- URL: http://arxiv.org/abs/2209.00077v1
- Date: Wed, 31 Aug 2022 19:17:00 GMT
- Title: Two-stage Hypothesis Tests for Variable Interactions with FDR Control
- Authors: Jingyi Duan, Yang Ning, Xi Chen, Yong Chen
- Abstract summary: We propose a two-stage testing procedure with false discovery rate (FDR) control, which is known as a less conservative multiple-testing correction.
We demonstrate via comprehensive simulation studies that our two-stage procedure is more efficient than the classical BH procedure, with a comparable or improved statistical power.
- Score: 10.750902543185802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many scenarios such as genome-wide association studies where dependences
between variables commonly exist, it is often of interest to infer the
interaction effects in the model. However, testing pairwise interactions among
millions of variables in complex and high-dimensional data suffers from low
statistical power and huge computational cost. To address these challenges, we
propose a two-stage testing procedure with false discovery rate (FDR) control,
which is known as a less conservative multiple-testing correction.
Theoretically, the difficulty in the FDR control dues to the data dependence
among test statistics in two stages, and the fact that the number of hypothesis
tests conducted in the second stage depends on the screening result in the
first stage. By using the Cram\'er type moderate deviation technique, we show
that our procedure controls FDR at the desired level asymptotically in the
generalized linear model (GLM), where the model is allowed to be misspecified.
In addition, the asymptotic power of the FDR control procedure is rigorously
established. We demonstrate via comprehensive simulation studies that our
two-stage procedure is computationally more efficient than the classical BH
procedure, with a comparable or improved statistical power. Finally, we apply
the proposed method to a bladder cancer data from dbGaP where the scientific
goal is to identify genetic susceptibility loci for bladder cancer.
Related papers
- Likelihood-Free Inference and Hierarchical Data Assimilation for Geological Carbon Storage [0.0]
We develop a hierarchical data assimilation framework for carbon storage.
Uses Monte Carlo-based approximate Bayesian computation.
Reduces computational costs by using a 3D recurrent R-U-Net deep-learning surrogate model.
arXiv Detail & Related papers (2024-10-20T06:15:56Z) - Near-optimal multiple testing in Bayesian linear models with
finite-sample FDR control [11.011242089340438]
In high dimensional variable selection problems, statisticians often seek to design multiple testing procedures that control the False Discovery Rate (FDR)
We introduce Model-X procedures that provably control the frequentist FDR from finite samples, even when the model is misspecified.
Our proposed procedure, PoEdCe, incorporates three key ingredients: Posterior Expectation, distilled randomization test (dCRT), and the Benjamini-Hochberg procedure with e-values.
arXiv Detail & Related papers (2022-11-04T22:56:41Z) - Probabilistic Model Incorporating Auxiliary Covariates to Control FDR [6.270317798744481]
Controlling False Discovery Rate (FDR) while leveraging the side information of multiple hypothesis testing is an emerging research topic in modern data science.
We propose a deep Black-Box framework controlling FDR (named as NeurT-FDR) which boosts statistical power and controls FDR for multiple-hypothesis testing.
We show that NeurT-FDR makes substantially more discoveries in three real datasets compared to competitive baselines.
arXiv Detail & Related papers (2022-10-06T19:35:53Z) - Statistical and Computational Phase Transitions in Group Testing [73.55361918807883]
We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease.
We consider two different simple random procedures for assigning individuals tests.
arXiv Detail & Related papers (2022-06-15T16:38:50Z) - Sequential Permutation Testing of Random Forest Variable Importance
Measures [68.8204255655161]
It is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests.
The results of simulation studies confirm that the theoretical properties of the sequential tests apply.
The numerical stability of the methods is investigated in two additional application studies.
arXiv Detail & Related papers (2022-06-02T20:16:50Z) - Directional FDR Control for Sub-Gaussian Sparse GLMs [4.229179009157074]
False discovery rate (FDR) control aims to identify some small number of statistically significantly nonzero results.
We construct the debiased matrix-Lasso estimator and prove the normality by minimax-rate oracle inequalities for sparse GLMs.
arXiv Detail & Related papers (2021-05-02T05:34:32Z) - Deep Learning in current Neuroimaging: a multivariate approach with
power and type I error control but arguable generalization ability [0.158310730488265]
A non-parametric framework is proposed that estimates the statistical significance of classifications using deep learning architectures.
A label permutation test is proposed in both studies using cross-validation (CV) and resubstitution with upper bound correction (RUB) as validation methods.
We found in the permutation test that CV and RUB methods offer a false positive rate close to the significance level and an acceptable statistical power.
arXiv Detail & Related papers (2021-03-30T21:15:39Z) - Bayesian prognostic covariate adjustment [59.75318183140857]
Historical data about disease outcomes can be integrated into the analysis of clinical trials in many ways.
We build on existing literature that uses prognostic scores from a predictive model to increase the efficiency of treatment effect estimates.
arXiv Detail & Related papers (2020-12-24T05:19:03Z) - Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak.
Standard methods struggle to accommodate the partial observability and sparse data common at finer scales.
We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z) - Lower bounds in multiple testing: A framework based on derandomized
proxies [107.69746750639584]
This paper introduces an analysis strategy based on derandomization, illustrated by applications to various concrete models.
We provide numerical simulations of some of these lower bounds, and show a close relation to the actual performance of the Benjamini-Hochberg (BH) algorithm.
arXiv Detail & Related papers (2020-05-07T19:59:51Z) - SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier
Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples.
We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.