Heterogeneous Transfer Learning for Building High-Dimensional
Generalized Linear Models with Disparate Datasets
- URL: http://arxiv.org/abs/2312.12786v1
- Date: Wed, 20 Dec 2023 06:11:59 GMT
- Title: Heterogeneous Transfer Learning for Building High-Dimensional
Generalized Linear Models with Disparate Datasets
- Authors: Ruzhang Zhao, Prosenjit Kundu, Arkajyoti Saha, Nilanjan Chatterjee
- Abstract summary: We describe a transfer learning approach for building high-dimensional generalized linear models.
We show that the use of adaptive-Lasso penalty leads to the oracle property of underlying parameter estimates.
We illustrate a timely application of the proposed method for the development of risk prediction models for five common diseases.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Development of comprehensive prediction models are often of great interest in
many disciplines of science, but datasets with information on all desired
features typically have small sample sizes. In this article, we describe a
transfer learning approach for building high-dimensional generalized linear
models using data from a main study that has detailed information on all
predictors, and from one or more external studies that have ascertained a more
limited set of predictors. We propose using the external dataset(s) to build
reduced model(s) and then transfer the information on underlying parameters for
the analysis of the main study through a set of calibration equations, while
accounting for the study-specific effects of certain design variables. We then
use a generalized method of moment (GMM) with penalization for parameter
estimation and develop highly scalable algorithms for fitting models taking
advantage of the popular glmnet package. We further show that the use of
adaptive-Lasso penalty leads to the oracle property of underlying parameter
estimates and thus leads to convenient post-selection inference procedures. We
conduct extensive simulation studies to investigate both predictive performance
and post-selection inference properties of the proposed method. Finally, we
illustrate a timely application of the proposed method for the development of
risk prediction models for five common diseases using the UK Biobank study,
combining baseline information from all study participants (500K) and recently
released high-throughout proteomic data (# protein = 1500) on a subset (50K) of
the participants.
Related papers
- Few-Shot Load Forecasting Under Data Scarcity in Smart Grids: A Meta-Learning Approach [0.18641315013048293]
This paper proposes adapting an established model-agnostic meta-learning algorithm for short-term load forecasting.
The proposed method can rapidly adapt and generalize within any unknown load time series of arbitrary length.
The proposed model is evaluated using a dataset of historical load consumption data from real-world consumers.
arXiv Detail & Related papers (2024-06-09T18:59:08Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - IGANN Sparse: Bridging Sparsity and Interpretability with Non-linear Insight [4.010646933005848]
IGANN Sparse is a novel machine learning model from the family of generalized additive models.
It promotes sparsity through a non-linear feature selection process during training.
This ensures interpretability through improved model sparsity without sacrificing predictive performance.
arXiv Detail & Related papers (2024-03-17T22:44:36Z) - Toward the Identifiability of Comparative Deep Generative Models [7.5479347719819865]
We propose a theory of identifiability for comparative Deep Generative Models (DGMs)
We show that, while these models lack identifiability across a general class of mixing functions, they surprisingly become identifiable when the mixing function is piece-wise affine.
We also investigate the impact of model misspecification, and empirically show that previously proposed regularization techniques for fitting comparative DGMs help with identifiability when the number of latent variables is not known in advance.
arXiv Detail & Related papers (2024-01-29T06:10:54Z) - An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration [11.102950630209879]
In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy.
We examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration.
arXiv Detail & Related papers (2023-07-17T01:27:10Z) - Prediction-Oriented Bayesian Active Learning [51.426960808684655]
Expected predictive information gain (EPIG) is an acquisition function that measures information gain in the space of predictions rather than parameters.
EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models.
arXiv Detail & Related papers (2023-04-17T10:59:57Z) - SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in
Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods.
Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z) - MRCLens: an MRC Dataset Bias Detection Toolkit [82.44296974850639]
We introduce MRCLens, a toolkit that detects whether biases exist before users train the full model.
For the convenience of introducing the toolkit, we also provide a categorization of common biases in MRC.
arXiv Detail & Related papers (2022-07-18T21:05:39Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - A Clustering-aided Ensemble Method for Predicting Ridesourcing Demand in
Chicago [0.0]
This study proposes a Clustering-aided Ensemble Method (CEM) to forecast the zone-to-zone travel demand for ridesourcing services.
We implement and test the proposed methodology by using the ridesourcing-trip data in Chicago.
arXiv Detail & Related papers (2021-09-08T04:58:29Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.