Improving Large Language Models for Clinical Named Entity Recognition
via Prompt Engineering
- URL: http://arxiv.org/abs/2303.16416v3
- Date: Thu, 25 Jan 2024 04:02:23 GMT
- Title: Improving Large Language Models for Clinical Named Entity Recognition
via Prompt Engineering
- Authors: Yan Hu, Qingyu Chen, Jingcheng Du, Xueqing Peng, Vipina Kuttichi
Keloth, Xu Zuo, Yujia Zhou, Zehan Li, Xiaoqian Jiang, Zhiyong Lu, Kirk
Roberts, Hua Xu
- Abstract summary: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks.
We developed a task-specific prompt framework that includes baseline prompts, annotation guideline-based prompts, error analysis-based instructions, and annotated samples.
We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.
- Score: 20.534197056683695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective: This study quantifies the capabilities of GPT-3.5 and GPT-4 for
clinical named entity recognition (NER) tasks and proposes task-specific
prompts to improve their performance. Materials and Methods: We evaluated these
models on two clinical NER tasks: (1) to extract medical problems, treatments,
and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2
concept extraction shared task, and (2) identifying nervous system
disorder-related adverse events from safety reports in the vaccine adverse
event reporting system (VAERS). To improve the GPT models' performance, we
developed a clinical task-specific prompt framework that includes (1) baseline
prompts with task description and format specification, (2) annotation
guideline-based prompts, (3) error analysis-based instructions, and (4)
annotated samples for few-shot learning. We assessed each prompt's
effectiveness and compared the models to BioClinicalBERT. Results: Using
baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804
for MTSamples, and 0.301, 0.593 for VAERS. Additional prompt components
consistently improved model performance. When all four components were used,
GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and
0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt
framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the
MTSamples dataset and 0.802 for the VAERS), it is very promising considering
few training samples are needed. Conclusion: While direct application of GPT
models to clinical NER tasks falls short of optimal performance, our
task-specific prompt framework, incorporating medical knowledge and training
samples, significantly enhances GPT models' feasibility for potential clinical
applications.
Related papers
- Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design [15.2100541345819]
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design.
It consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications.
arXiv Detail & Related papers (2024-06-25T18:52:48Z) - Towards Efficient Patient Recruitment for Clinical Trials: Application of a Prompt-Based Learning Model [0.7373617024876725]
Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants.
The complex nature of unstructured medical texts presents challenges in efficiently identifying participants.
In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task.
arXiv Detail & Related papers (2024-04-24T20:42:28Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Low-resource classification of mobility functioning information in
clinical sentences using large language models [0.0]
This study evaluates the ability of publicly available large language models (LLMs) to accurately identify the presence of functioning information from clinical notes.
We collect a balanced binary classification dataset of 1000 sentences from the Mobility NER dataset, which was curated from n2c2 clinical notes.
arXiv Detail & Related papers (2023-12-15T20:59:17Z) - Leveraging deep active learning to identify low-resource mobility
functioning information in public clinical notes [0.157286095422595]
First public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF)
We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion.
Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities.
arXiv Detail & Related papers (2023-11-27T15:53:11Z) - An evaluation of GPT models for phenotype concept recognition [0.4715973318447338]
We examine the performance of the latest Generative Pre-trained Transformer (GPT) models for clinical phenotyping and phenotype annotation.
Our results show that, with an appropriate setup, these models can achieve state of the art performance.
arXiv Detail & Related papers (2023-09-29T12:06:55Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - AutoTrial: Prompting Language Models for Clinical Trial Design [53.630479619856516]
We present a method named AutoTrial to aid the design of clinical eligibility criteria using language models.
Experiments on over 70K clinical trials verify that AutoTrial generates high-quality criteria texts.
arXiv Detail & Related papers (2023-05-19T01:04:16Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Systematic Clinical Evaluation of A Deep Learning Method for Medical
Image Segmentation: Radiosurgery Application [48.89674088331313]
We systematically evaluate a Deep Learning (DL) method in a 3D medical image segmentation task.
Our method is integrated into the radiosurgery treatment process and directly impacts the clinical workflow.
arXiv Detail & Related papers (2021-08-21T16:15:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.