A comparative study of zero-shot inference with large language models
and supervised modeling in breast cancer pathology classification
- URL: http://arxiv.org/abs/2401.13887v1
- Date: Thu, 25 Jan 2024 02:05:31 GMT
- Title: A comparative study of zero-shot inference with large language models
and supervised modeling in breast cancer pathology classification
- Authors: Madhumita Sushil, Travis Zack, Divneet Mandair, Zhiwei Zheng, Ahmed
Wali, Yan-Ning Yu, Yuwei Quan, Atul J. Butte
- Abstract summary: Large language models (LLMs) have demonstrated promising transfer learning capability.
LLMs demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for curating large annotated datasets.
This may result in an increase in the utilization of NLP-based variables and outcomes in observational clinical studies.
- Score: 1.4715634464004446
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Although supervised machine learning is popular for information extraction
from clinical notes, creating large annotated datasets requires extensive
domain expertise and is time-consuming. Meanwhile, large language models (LLMs)
have demonstrated promising transfer learning capability. In this study, we
explored whether recent LLMs can reduce the need for large-scale data
annotations. We curated a manually-labeled dataset of 769 breast cancer
pathology reports, labeled with 13 categories, to compare zero-shot
classification capability of the GPT-4 model and the GPT-3.5 model with
supervised classification performance of three model architectures: random
forests classifier, long short-term memory networks with attention (LSTM-Att),
and the UCSF-BERT model. Across all 13 tasks, the GPT-4 model performed either
significantly better than or as well as the best supervised model, the LSTM-Att
model (average macro F1 score of 0.83 vs. 0.75). On tasks with high imbalance
between labels, the differences were more prominent. Frequent sources of GPT-4
errors included inferences from multiple samples and complex task design. On
complex tasks where large annotated datasets cannot be easily collected, LLMs
can reduce the burden of large-scale data labeling. However, if the use of LLMs
is prohibitive, the use of simpler supervised models with large annotated
datasets can provide comparable results. LLMs demonstrated the potential to
speed up the execution of clinical NLP studies by reducing the need for
curating large annotated datasets. This may result in an increase in the
utilization of NLP-based variables and outcomes in observational clinical
studies.
Related papers
- LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources [13.750202656564907]
Adverse event (AE) extraction is crucial for monitoring and analyzing the safety profiles of immunizations.
This study aims to evaluate the effectiveness of large language models (LLMs) and traditional deep learning models in AE extraction.
arXiv Detail & Related papers (2024-06-26T03:56:21Z) - Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data [3.9459077974367833]
Large language models (LLMs) have demonstrated remarkable success in NLP tasks.
We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks.
Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data
arXiv Detail & Related papers (2024-03-27T22:05:10Z) - Identifying Factual Inconsistencies in Summaries: Grounding Model Inference via Task Taxonomy [48.29181662640212]
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models.
We consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs.
arXiv Detail & Related papers (2024-02-20T08:41:23Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - CancerGPT: Few-shot Drug Pair Synergy Prediction using Large Pre-trained
Language Models [3.682742580232362]
Large pre-trained language models (LLMs) have been shown to have significant potential in few-shot learning across various fields.
Our research is the first to tackle drug pair synergy prediction in rare tissues with limited data.
arXiv Detail & Related papers (2023-04-18T02:49:53Z) - CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection [36.08551407926805]
We propose the CLIP-Driven Universal Model, which incorporates text embedding learned from Contrastive Language-Image Pre-training to segmentation models.
The proposed model is developed from an assembly of 14 datasets, using a total of 3,410 CT scans for training and then evaluated on 6,162 external CT scans from 3 additional datasets.
arXiv Detail & Related papers (2023-01-02T18:07:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.