Related papers: Diffusion-Driven High-Dimensional Variable Selection

Diffusion-Driven High-Dimensional Variable Selection

URL: http://arxiv.org/abs/2508.13890v1
Date: Tue, 19 Aug 2025 14:54:20 GMT
Title: Diffusion-Driven High-Dimensional Variable Selection
Authors: Minjie Wang, Xiaotong Shen, Wei Pan,
Abstract summary: We propose a resample-aggregate framework that exploits diffusion models' ability to generate high-fidelity synthetic data.<n>We show that the proposed method is selection consistent under mild assumptions.<n>Our method advances variable selection methodology and broadens the toolkit for interpretable, statistically rigorous analysis.
Score: 6.993247097440294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Variable selection for high-dimensional, highly correlated data has long been a challenging problem, often yielding unstable and unreliable models. We propose a resample-aggregate framework that exploits diffusion models' ability to generate high-fidelity synthetic data. Specifically, we draw multiple pseudo-data sets from a diffusion model fitted to the original data, apply any off-the-shelf selector (e.g., lasso or SCAD), and store the resulting inclusion indicators and coefficients. Aggregating across replicas produces a stable subset of predictors with calibrated stability scores for variable selection. Theoretically, we show that the proposed method is selection consistent under mild assumptions. Because the generative model imports knowledge from large pre-trained weights, the procedure naturally benefits from transfer learning, boosting power when the observed sample is small or noisy. We also extend the framework of aggregating synthetic data to other model selection problems, including graphical model selection, and statistical inference that supports valid confidence intervals and hypothesis tests. Extensive simulations show consistent gains over the lasso, stability selection, and knockoff baselines, especially when predictors are strongly correlated, achieving higher true-positive rates and lower false-discovery proportions. By coupling diffusion-based data augmentation with principled aggregation, our method advances variable selection methodology and broadens the toolkit for interpretable, statistically rigorous analysis in complex scientific applications.

Related papers

Distributionally Robust Feature Selection [14.493253907785473]
We study the problem of selecting limited features to observe such that models trained on them can perform well simultaneously across multiple subpopulations.<n>Our method frames the problem as a continuous relaxation of traditional variable selection using a noising mechanism.<n>We develop a model-agnostic framework that balances overall performance of downstream prediction across populations.
arXiv Detail & Related papers (2025-10-24T03:03:30Z)
Going from a Representative Agent to Counterfactuals in Combinatorial Choice [2.9172603864294033]
We study decision-making problems where data comprises points from a collection of binary polytopes.<n>We propose a nonparametric approach for counterfactual inference in this setting based on a representative agent model.
arXiv Detail & Related papers (2025-05-29T15:24:23Z)
Spatial Reasoning with Denoising Models [49.83744014336816]
We introduce a framework to perform reasoning over sets of continuous variables via denoising generative models.<n>For the first time, that order of generation can successfully be predicted by the denoising network itself.<n>Using these findings, we can increase the accuracy of specific reasoning tasks from 1% to >50%.
arXiv Detail & Related papers (2025-02-28T14:08:30Z)
Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation. In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model. We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z)
Improving Out-of-Distribution Robustness of Classifiers via Generative Interpolation [56.620403243640396]
Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data. However, their performance deteriorates significantly when handling out-of-distribution (OoD) data. We develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples.
arXiv Detail & Related papers (2023-07-23T03:53:53Z)
Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z)
Learning Multivariate CDFs and Copulas using Tensor Factorization [39.24470798045442]
Learning the multivariate distribution of data is a core challenge in statistics and machine learning. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables. We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model. We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation.
arXiv Detail & Related papers (2022-10-13T16:18:46Z)
Two-Stage Robust and Sparse Distributed Statistical Inference for Large-Scale Data [18.34490939288318]
We address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers. We propose a two-stage distributed and robust statistical inference procedures coping with high-dimensional models by promoting sparsity.
arXiv Detail & Related papers (2022-08-17T11:17:47Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Transfer Learning with Multi-source Data: High-dimensional Inference for Group Distributionally Robust Models [0.0]
Learning with multi-source data helps improve model generalizability and is integral to many important statistical problems. This paper considers multiple high-dimensional regression models for the multi-source data. We devise a novel it DenseNet sampling method to construct valid confidence intervals for the high-dimensional maximin effect.
arXiv Detail & Related papers (2020-11-15T16:15:10Z)
Causal Transfer Random Forest: Combining Logged Data and Randomized Experiments for Robust Prediction [8.736551469632758]
We describe a causal transfer random forest (CTRF) that combines existing training data with a small amount of data from a randomized experiment to train a model. We evaluate the CTRF using both synthetic data experiments and real-world experiments in the Bing Ads platform.
arXiv Detail & Related papers (2020-10-17T03:54:37Z)
Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization. We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise. We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.