Identifying and Extracting Rare Disease Phenotypes with Large Language
Models
- URL: http://arxiv.org/abs/2306.12656v1
- Date: Thu, 22 Jun 2023 03:52:12 GMT
- Title: Identifying and Extracting Rare Disease Phenotypes with Large Language
Models
- Authors: Cathy Shyr, Yan Hu, Paul A. Harris, Hua Xu
- Abstract summary: ChatGPT is a revolutionary large language model capable of following complex human prompts and generating high-quality responses.
We compared its performance to the traditional fine-tuning approach and conducted an in-depth error analysis.
ChatGPT achieved similar or higher accuracy for certain entities (i.e., rare diseases and signs) in the one-shot setting.
- Score: 12.555067118549347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rare diseases (RDs) are collectively common and affect 300 million people
worldwide. Accurate phenotyping is critical for informing diagnosis and
treatment, but RD phenotypes are often embedded in unstructured text and
time-consuming to extract manually. While natural language processing (NLP)
models can perform named entity recognition (NER) to automate extraction, a
major bottleneck is the development of a large, annotated corpus for model
training. Recently, prompt learning emerged as an NLP paradigm that can lead to
more generalizable results without any (zero-shot) or few labeled samples
(few-shot). Despite growing interest in ChatGPT, a revolutionary large language
model capable of following complex human prompts and generating high-quality
responses, none have studied its NER performance for RDs in the zero- and
few-shot settings. To this end, we engineered novel prompts aimed at extracting
RD phenotypes and, to the best of our knowledge, are the first the establish a
benchmark for evaluating ChatGPT's performance in these settings. We compared
its performance to the traditional fine-tuning approach and conducted an
in-depth error analysis. Overall, fine-tuning BioClinicalBERT resulted in
higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.591 in the
zero- and few-shot settings, respectively). Despite this, ChatGPT achieved
similar or higher accuracy for certain entities (i.e., rare diseases and signs)
in the one-shot setting (F1 of 0.776 and 0.725). This suggests that with
appropriate prompt engineering, ChatGPT has the potential to match or
outperform fine-tuned language models for certain entity types with just one
labeled sample. While the proliferation of large language models may provide
opportunities for supporting RD diagnosis and treatment, researchers and
clinicians should critically evaluate model outputs and be well-informed of
their limitations.
Related papers
- Leveraging Prompt-Learning for Structured Information Extraction from Crohn's Disease Radiology Reports in a Low-Resource Language [11.688665498310405]
SMP-BERT is a novel prompt learning method for automatically converting free-text radiology reports into structured data.
In our studies, SMP-BERT greatly surpassed traditional fine-tuning methods in performance, notably in detecting infrequent conditions.
arXiv Detail & Related papers (2024-05-02T19:11:54Z) - Use GPT-J Prompt Generation with RoBERTa for NER Models on Diagnosis
Extraction of Periodontal Diagnosis from Electronic Dental Records [6.636721448099117]
The prompt generation by GPT-J models was utilized to test the gold standard and to generate the seed.
The performance revealed consistency, 0.92-0.97 in the F1 score, in all settings after training with the RoBERTa model.
arXiv Detail & Related papers (2023-11-17T18:14:08Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - An evaluation of GPT models for phenotype concept recognition [0.4715973318447338]
We examine the performance of the latest Generative Pre-trained Transformer (GPT) models for clinical phenotyping and phenotype annotation.
Our results show that, with an appropriate setup, these models can achieve state of the art performance.
arXiv Detail & Related papers (2023-09-29T12:06:55Z) - Large Language Models to Identify Social Determinants of Health in
Electronic Health Records [2.168737004368243]
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHRs)
This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented.
800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated.
arXiv Detail & Related papers (2023-08-11T19:18:35Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - Natural Language Processing Methods to Identify Oncology Patients at
High Risk for Acute Care with Clinical Notes [9.49721872804122]
This paper evaluates how natural language processing can be used to identify the risk of acute care use (ACU) in oncology patients.
Risk prediction using structured health data (SHD) is now standard, but predictions using free-text formats are complex.
arXiv Detail & Related papers (2022-09-28T06:31:19Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.