Tutorial: a priori estimation of sample size, effect size, and
statistical power for cluster analysis, latent class analysis, and
multivariate mixture models
- URL: http://arxiv.org/abs/2309.00866v1
- Date: Sat, 2 Sep 2023 08:48:00 GMT
- Title: Tutorial: a priori estimation of sample size, effect size, and
statistical power for cluster analysis, latent class analysis, and
multivariate mixture models
- Authors: Edwin S Dalmaijer
- Abstract summary: This tutorial provides a roadmap to determining sample size and effect size for analyses that identify subgroups.
I introduce a procedure that allows researchers to formalise their expectations about effect sizes in their domain of choice.
Next, I outline how to establish the minimum sample size in subgroup analyses.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Before embarking on data collection, researchers typically compute how many
individual observations they should do. This is vital for doing studies with
sufficient statistical power, and often a cornerstone in study
pre-registrations and grant applications. For traditional statistical tests,
one would typically determine an acceptable level of statistical power,
(gu)estimate effect size, and then use both values to compute the required
sample size. However, for analyses that identify subgroups, statistical power
is harder to establish. Once sample size reaches a sufficient threshold, effect
size is primarily determined by the number of measured features and the
underlying subgroup separation. As a consequence, a priory computations of
statistical power are notoriously complex. In this tutorial, I will provide a
roadmap to determining sample size and effect size for analyses that identify
subgroups. First, I introduce a procedure that allows researchers to formalise
their expectations about effect sizes in their domain of choice, and use this
to compute the minimally required number of measured variables. Next, I outline
how to establish the minimum sample size in subgroup analyses. Finally, I use
simulations to provide a reference table for the most popular subgroup
analyses: k-means, Ward agglomerative hierarchical clustering, c-means fuzzy
clustering, latent class analysis, latent profile analysis, and Gaussian
mixture modelling. The table shows the minimum numbers of observations per
expected subgroup (sample size) and features (measured variables) to achieve
acceptable statistical power, and can be readily used in study design.
Related papers
- Sample Size in Natural Language Processing within Healthcare Research [0.14865681381012494]
Lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies.
This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain.
arXiv Detail & Related papers (2023-09-05T13:42:43Z) - Toward Generalizable Machine Learning Models in Speech, Language, and
Hearing Sciences: Estimating Sample Size and Reducing Overfitting [1.8416014644193064]
This study uses Monte Carlo simulations to quantify the interactions between the employed cross-validation method and the discnative power of features.
The required sample size with a single holdout could be 50% higher than what would be needed if nested crossvalidation were used.
arXiv Detail & Related papers (2023-08-22T05:14:42Z) - A Statistical View of Column Subset Selection [47.65143789184956]
We consider the problem of selecting a small subset of representative variables from a large dataset.
We show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
arXiv Detail & Related papers (2023-07-24T15:42:33Z) - Statistical and Computational Phase Transitions in Group Testing [73.55361918807883]
We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease.
We consider two different simple random procedures for assigning individuals tests.
arXiv Detail & Related papers (2022-06-15T16:38:50Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Model-based metrics: Sample-efficient estimates of predictive model
subpopulation performance [11.994417027132807]
Machine learning models $-$ now commonly developed to screen, diagnose, or predict health conditions are evaluated with a variety of performance metrics.
Subpopulation performance metrics are typically computed using only data from that subgroup, resulting in higher variance estimates for smaller groups.
We propose using an evaluation model $-$ a model that describes the conditional distribution of the predictive model score $-$ to form model-based metric (MBM) estimates.
arXiv Detail & Related papers (2021-04-25T19:06:34Z) - Flexible Model Aggregation for Quantile Regression [92.63075261170302]
Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions.
We investigate methods for aggregating any number of conditional quantile models.
All of the models we consider in this paper can be fit using modern deep learning toolkits.
arXiv Detail & Related papers (2021-02-26T23:21:16Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z) - Statistical power for cluster analysis [0.0]
Cluster algorithms are increasingly popular in biomedical research.
We estimate power and accuracy for common analysis through simulation.
We recommend that researchers only apply cluster analysis when large subgroup separation is expected.
arXiv Detail & Related papers (2020-03-01T02:43:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.