Sample Size in Natural Language Processing within Healthcare Research
- URL: http://arxiv.org/abs/2309.02237v1
- Date: Tue, 5 Sep 2023 13:42:43 GMT
- Title: Sample Size in Natural Language Processing within Healthcare Research
- Authors: Jaya Chaturvedi, Diana Shamsutdinova, Felix Zimmer, Sumithra
Velupillai, Daniel Stahl, Robert Stewart, Angus Roberts
- Abstract summary: Lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies.
This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain.
- Score: 0.14865681381012494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sample size calculation is an essential step in most data-based disciplines.
Large enough samples ensure representativeness of the population and determine
the precision of estimates. This is true for most quantitative studies,
including those that employ machine learning methods, such as natural language
processing, where free-text is used to generate predictions and classify
instances of text. Within the healthcare domain, the lack of sufficient corpora
of previously collected data can be a limiting factor when determining sample
sizes for new studies. This paper tries to address the issue by making
recommendations on sample sizes for text classification tasks in the healthcare
domain.
Models trained on the MIMIC-III database of critical care records from Beth
Israel Deaconess Medical Center were used to classify documents as having or
not having Unspecified Essential Hypertension, the most common diagnosis code
in the database. Simulations were performed using various classifiers on
different sample sizes and class proportions. This was repeated for a
comparatively less common diagnosis code within the database of diabetes
mellitus without mention of complication.
Smaller sample sizes resulted in better results when using a K-nearest
neighbours classifier, whereas larger sample sizes provided better results with
support vector machines and BERT models. Overall, a sample size larger than
1000 was sufficient to provide decent performance metrics.
The simulations conducted within this study provide guidelines that can be
used as recommendations for selecting appropriate sample sizes and class
proportions, and for predicting expected performance, when building classifiers
for textual healthcare data. The methodology used here can be modified for
sample size estimates calculations with other datasets.
Related papers
- Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
This study proposes using large language models (LLMs) to elicit expert prior distributions for predictive models.
We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation.
Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings.
arXiv Detail & Related papers (2024-11-26T10:13:39Z) - Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study [2.0884301753594334]
This study performs a comparative analysis of various natural language models for medical text classification.
BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes.
arXiv Detail & Related papers (2024-08-30T10:28:49Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Using text embedding models and vector databases as text classifiers
with the example of medical data [0.0]
We explore the use of vector databases and embedding models as a means of encoding, and classifying text with the example and application in the field of medicine.
We show the robustness of these tools depends heavily on the sparsity of the data presented, and even with low amounts of data in the vector database itself, the vector database does a good job at classifying data.
arXiv Detail & Related papers (2024-02-07T22:15:15Z) - Tutorial: a priori estimation of sample size, effect size, and
statistical power for cluster analysis, latent class analysis, and
multivariate mixture models [0.0]
This tutorial provides a roadmap to determining sample size and effect size for analyses that identify subgroups.
I introduce a procedure that allows researchers to formalise their expectations about effect sizes in their domain of choice.
Next, I outline how to establish the minimum sample size in subgroup analyses.
arXiv Detail & Related papers (2023-09-02T08:48:00Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - A Large Scale Benchmark for Individual Treatment Effect Prediction and
Uplift Modeling [7.1736440498963105]
Individual Treatment Effect (ITE) prediction aims at explaining and estimating the causal impact of an action at the granular level.
To foster research on this topic we release a publicly available collection of 13.9 million samples collected from several randomized control trials.
arXiv Detail & Related papers (2021-11-19T09:07:14Z) - A Real Use Case of Semi-Supervised Learning for Mammogram Classification
in a Local Clinic of Costa Rica [0.5541644538483946]
Training a deep learning model requires a considerable amount of labeled images.
A number of publicly available datasets have been built with data from different hospitals and clinics.
The use of the semi-supervised deep learning approach known as MixMatch, to leverage the usage of unlabeled data is proposed and evaluated.
arXiv Detail & Related papers (2021-07-24T22:26:50Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.