For high-dimensional hierarchical models, consider exchangeability of
effects across covariates instead of across datasets
- URL: http://arxiv.org/abs/2107.06428v1
- Date: Tue, 13 Jul 2021 23:23:06 GMT
- Title: For high-dimensional hierarchical models, consider exchangeability of
effects across covariates instead of across datasets
- Authors: Brian L. Trippe, Hilary K. Finucane, Tamara Broderick
- Abstract summary: We show that standard practice exhibits poor statistical performance when the number of covariates exceeds the number of datasets.
In statistical genetics, we might regress dozens of traits (defining datasets) for thousands of individuals (responses) on up to millions of genetic variants.
We propose a hierarchical model expressing our alternative perspective.
- Score: 18.74167116981788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hierarchical Bayesian methods enable information sharing across multiple
related regression problems. While standard practice is to model regression
parameters (effects) as (1) exchangeable across datasets and (2) correlated to
differing degrees across covariates, we show that this approach exhibits poor
statistical performance when the number of covariates exceeds the number of
datasets. For instance, in statistical genetics, we might regress dozens of
traits (defining datasets) for thousands of individuals (responses) on up to
millions of genetic variants (covariates). When an analyst has more covariates
than datasets, we argue that it is often more natural to instead model effects
as (1) exchangeable across covariates and (2) correlated to differing degrees
across datasets. To this end, we propose a hierarchical model expressing our
alternative perspective. We devise an empirical Bayes estimator for learning
the degree of correlation between datasets. We develop theory that demonstrates
that our method outperforms the classic approach when the number of covariates
dominates the number of datasets, and corroborate this result empirically on
several high-dimensional multiple regression and classification problems.
Related papers
- Linked shrinkage to improve estimation of interaction effects in
regression models [0.0]
We develop an estimator that adapts well to two-way interaction terms in a regression model.
We evaluate the potential of the model for inference, which is notoriously hard for selection strategies.
Our models can be very competitive to a more advanced machine learner, like random forest, even for fairly large sample sizes.
arXiv Detail & Related papers (2023-09-25T10:03:39Z) - Multiple Augmented Reduced Rank Regression for Pan-Cancer Analysis [0.0]
We propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method.
We consider a structured nuclear norm objective that is motivated by random matrix theory.
We apply maRRR to gene expression data from multiple cancer types (i.e., pan-cancer) from TCGA.
arXiv Detail & Related papers (2023-08-30T21:40:58Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data.
Invariance measures consistency of model predictions on transformations of the data.
From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Controlling for multiple covariates [0.0]
A fundamental problem in statistics is to compare the outcomes attained by members of subpopulations.
Comparison makes the most sense when performed separately for individuals who are similar according to certain characteristics.
arXiv Detail & Related papers (2021-12-01T17:37:36Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Variable selection with missing data in both covariates and outcomes:
Imputation and machine learning [1.0333430439241666]
The missing data issue is ubiquitous in health studies.
Machine learning methods weaken parametric assumptions.
XGBoost and BART have the overall best performance across various settings.
arXiv Detail & Related papers (2021-04-06T20:18:29Z) - Reduced-Rank Tensor-on-Tensor Regression and Tensor-variate Analysis of
Variance [11.193504036335503]
We extend the classical multivariate regression model to exploit such structure.
We obtain maximum likelihood estimators via block-relaxation algorithms.
A separate application performs three-way TANOVA on the Labeled Faces in the Wild image database.
arXiv Detail & Related papers (2020-12-18T14:04:41Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z) - Learning from Aggregate Observations [82.44304647051243]
We study the problem of learning from aggregate observations where supervision signals are given to sets of instances.
We present a general probabilistic framework that accommodates a variety of aggregate observations.
Simple maximum likelihood solutions can be applied to various differentiable models.
arXiv Detail & Related papers (2020-04-14T06:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.