A Conditional Randomization Test for Sparse Logistic Regression in
High-Dimension
- URL: http://arxiv.org/abs/2205.14613v1
- Date: Sun, 29 May 2022 09:37:16 GMT
- Title: A Conditional Randomization Test for Sparse Logistic Regression in
High-Dimension
- Authors: Binh T. Nguyen, Bertrand Thirion, Sylvain Arlot
- Abstract summary: emphCRT-logit is an algorithm that combines a variable-distillation step and a decorrelation step.
We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets.
- Score: 36.00360315353985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying the relevant variables for a classification model with correct
confidence levels is a central but difficult task in high-dimension. Despite
the core role of sparse logistic regression in statistics and machine learning,
it still lacks a good solution for accurate inference in the regime where the
number of features $p$ is as large as or larger than the number of samples $n$.
Here, we tackle this problem by improving the Conditional Randomization Test
(CRT). The original CRT algorithm shows promise as a way to output p-values
while making few assumptions on the distribution of the test statistics. As it
comes with a prohibitive computational cost even in mildly high-dimensional
problems, faster solutions based on distillation have been proposed. Yet, they
rely on unrealistic hypotheses and result in low-power solutions. To improve
this, we propose \emph{CRT-logit}, an algorithm that combines a
variable-distillation step and a decorrelation step that takes into account the
geometry of $\ell_1$-penalized logistic regression problem. We provide a
theoretical analysis of this procedure, and demonstrate its effectiveness on
simulations, along with experiments on large-scale brain-imaging and genomics
datasets.
Related papers
- Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Globally Convergent Accelerated Algorithms for Multilinear Sparse
Logistic Regression with $\ell_0$-constraints [2.323238724742687]
Multilinear logistic regression serves as a powerful tool for the analysis of multidimensional data.
We propose an Accelerated Proximal Alternating Minim-MLSR model to solve the $ell_0$-MLSR.
We also demonstrate that APALM$+$ is globally convergent to a first-order critical point as well as to establish convergence by using the Kurdy-Lojasiewicz property.
arXiv Detail & Related papers (2023-09-17T11:05:08Z) - Statistical-Computational Tradeoffs in Mixed Sparse Linear Regression [20.00109111254507]
We show that the problem suffers from a $frackSNR2$-to-$frack2SNR2$ statistical-to-computational gap.
We also analyze a simple thresholding algorithm which, outside of the narrow regime where the problem is hard, solves the associated mixed regression detection problem.
arXiv Detail & Related papers (2023-03-03T18:03:49Z) - Retire: Robust Expectile Regression in High Dimensions [3.9391041278203978]
Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data.
We propose and study (penalized) robust expectile regression (retire)
We show that the proposed procedure can be efficiently solved by a semismooth Newton coordinate descent algorithm.
arXiv Detail & Related papers (2022-12-11T18:03:12Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Distributionally Robust Optimization with Markovian Data [8.126833795693699]
We study a program where the probability distribution of the uncertain problem parameters is unknown.
We propose a data-driven distributionally to estimate the problem's objective function and optimal solution.
arXiv Detail & Related papers (2021-06-12T10:59:02Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.