Related papers: What is in the model? A Comparison of variable selection criteria and model search approaches

What is in the model? A Comparison of variable selection criteria and model search approaches

URL: http://arxiv.org/abs/2510.02628v1
Date: Fri, 03 Oct 2025 00:14:43 GMT
Title: What is in the model? A Comparison of variable selection criteria and model search approaches
Authors: Shuangshuang Xu, Marco A. R. Ferreira, Allison N. Tegge,
Abstract summary: We compare variable selection methods using performance measures of correct identification rate (CIR), recall, and false discovery rate (FDR)<n>The results show that the exhaustive search BIC and search BIC outperform the other methods when considering the performance measures on small and large model spaces.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For many scientific questions, understanding the underlying mechanism is the goal. To help investigators better understand the underlying mechanism, variable selection is a crucial step that permits the identification of the most associated regression variables of interest. A variable selection method consists of model evaluation using an information criterion and a search of the model space. Here, we provide a comprehensive comparison of variable selection methods using performance measures of correct identification rate (CIR), recall, and false discovery rate (FDR). We consider the BIC and AIC for evaluating models, and exhaustive, greedy, LASSO path, and stochastic search approaches for searching the model space; we also consider LASSO using cross validation. We perform simulation studies for linear and generalized linear models that parametrically explore a wide range of realistic sample sizes, effect sizes, and correlations among regression variables. We consider model spaces with a small and larger number of potential regressors. The results show that the exhaustive search BIC and stochastic search BIC outperform the other methods when considering the performance measures on small and large model spaces, respectively. These approaches result in the highest CIR and lowest FDR, which collectively may support long-term efforts towards increasing replicability in research.

Related papers

Scalable branch-and-bound model selection with non-monotonic criteria including AIC, BIC and Mallows's $\mathit{C_p}$ [1.3592625530347717]
We introduce a simple but novel bound that enables the development of branch-and-bound algorithms tailored for non-monotonic functions.<n>We demonstrate that our approach guarantees identification of the optimal model(s) across diverse model classes, sizes, and applications.
arXiv Detail & Related papers (2025-12-13T07:16:10Z)
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models [63.00458229517523]
This work addresses the evaluation challenge of reward models by probing preference representations.<n>We construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions.<n>We introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability.
arXiv Detail & Related papers (2025-11-16T05:29:29Z)
Model Correlation Detection via Random Selection Probing [62.093777777813756]
Existing similarity-based methods require access to model parameters or produce scores without thresholds.<n>We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test.<n>RSP produces rigorous p-values that quantify evidence of correlation.
arXiv Detail & Related papers (2025-09-29T01:40:26Z)
Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
A Statistical-Modelling Approach to Feedforward Neural Network Model Selection [0.8287206589886881]
Feedforward neural networks (FNNs) can be viewed as non-linear regression models. A novel model selection method is proposed using the Bayesian information criterion (BIC) for FNNs. The choice of BIC over out-of-sample performance leads to an increased probability of recovering the true model.
arXiv Detail & Related papers (2022-07-09T11:07:04Z)
Bayesian Active Learning for Discrete Latent Variable Models [19.852463786440122]
Active learning seeks to reduce the amount of data required to fit the parameters of a model. latent variable models play a vital role in neuroscience, psychology, and a variety of other engineering and scientific disciplines.
arXiv Detail & Related papers (2022-02-27T19:07:12Z)
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning [1.0333430439241666]
The missing data issue is ubiquitous in health studies. Machine learning methods weaken parametric assumptions. XGBoost and BART have the overall best performance across various settings.
arXiv Detail & Related papers (2021-04-06T20:18:29Z)
Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets. We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization. We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise. We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.