Improving Large Language Models for Clinical Named Entity Recognition
via Prompt Engineering
- URL: http://arxiv.org/abs/2303.16416v3
- Date: Thu, 25 Jan 2024 04:02:23 GMT
- Title: Improving Large Language Models for Clinical Named Entity Recognition
via Prompt Engineering
- Authors: Yan Hu, Qingyu Chen, Jingcheng Du, Xueqing Peng, Vipina Kuttichi
Keloth, Xu Zuo, Yujia Zhou, Zehan Li, Xiaoqian Jiang, Zhiyong Lu, Kirk
Roberts, Hua Xu
- Abstract summary: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks.
We developed a task-specific prompt framework that includes baseline prompts, annotation guideline-based prompts, error analysis-based instructions, and annotated samples.
We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.
- Score: 20.534197056683695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective: This study quantifies the capabilities of GPT-3.5 and GPT-4 for
clinical named entity recognition (NER) tasks and proposes task-specific
prompts to improve their performance. Materials and Methods: We evaluated these
models on two clinical NER tasks: (1) to extract medical problems, treatments,
and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2
concept extraction shared task, and (2) identifying nervous system
disorder-related adverse events from safety reports in the vaccine adverse
event reporting system (VAERS). To improve the GPT models' performance, we
developed a clinical task-specific prompt framework that includes (1) baseline
prompts with task description and format specification, (2) annotation
guideline-based prompts, (3) error analysis-based instructions, and (4)
annotated samples for few-shot learning. We assessed each prompt's
effectiveness and compared the models to BioClinicalBERT. Results: Using
baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804
for MTSamples, and 0.301, 0.593 for VAERS. Additional prompt components
consistently improved model performance. When all four components were used,
GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and
0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt
framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the
MTSamples dataset and 0.802 for the VAERS), it is very promising considering
few training samples are needed. Conclusion: While direct application of GPT
models to clinical NER tasks falls short of optimal performance, our
task-specific prompt framework, incorporating medical knowledge and training
samples, significantly enhances GPT models' feasibility for potential clinical
applications.
Related papers
- Process-Supervised Reward Models for Verifying Clinical Note Generation: A Scalable Approach Guided by Domain Expertise [19.71388941192149]
We train a PRM to provide step-level reward signals for clinical notes generated by large language models (LLM)
Our proposed PRM, trained on the LLaMA-3.1 8B instruct model, outperformed both Gemini-Pro 1.5 and the vanilla outcome-supervised reward model (ORM) in two key evaluations.
arXiv Detail & Related papers (2024-12-17T06:24:34Z) - SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge.
We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation [42.06416052431378]
2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy.
We collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning to train BrainGPT models to generate radiology-adherent 3D brain CT reports.
Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
arXiv Detail & Related papers (2024-07-02T12:58:35Z) - CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design [15.2100541345819]
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design.
It consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications.
arXiv Detail & Related papers (2024-06-25T18:52:48Z) - Towards Efficient Patient Recruitment for Clinical Trials: Application of a Prompt-Based Learning Model [0.7373617024876725]
Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants.
The complex nature of unstructured medical texts presents challenges in efficiently identifying participants.
In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task.
arXiv Detail & Related papers (2024-04-24T20:42:28Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Low-resource classification of mobility functioning information in
clinical sentences using large language models [0.0]
This study evaluates the ability of publicly available large language models (LLMs) to accurately identify the presence of functioning information from clinical notes.
We collect a balanced binary classification dataset of 1000 sentences from the Mobility NER dataset, which was curated from n2c2 clinical notes.
arXiv Detail & Related papers (2023-12-15T20:59:17Z) - An evaluation of GPT models for phenotype concept recognition [0.4715973318447338]
We examine the performance of the latest Generative Pre-trained Transformer (GPT) models for clinical phenotyping and phenotype annotation.
Our results show that, with an appropriate setup, these models can achieve state of the art performance.
arXiv Detail & Related papers (2023-09-29T12:06:55Z) - AutoTrial: Prompting Language Models for Clinical Trial Design [53.630479619856516]
We present a method named AutoTrial to aid the design of clinical eligibility criteria using language models.
Experiments on over 70K clinical trials verify that AutoTrial generates high-quality criteria texts.
arXiv Detail & Related papers (2023-05-19T01:04:16Z) - Systematic Clinical Evaluation of A Deep Learning Method for Medical
Image Segmentation: Radiosurgery Application [48.89674088331313]
We systematically evaluate a Deep Learning (DL) method in a 3D medical image segmentation task.
Our method is integrated into the radiosurgery treatment process and directly impacts the clinical workflow.
arXiv Detail & Related papers (2021-08-21T16:15:40Z) - Predicting Clinical Trial Results by Implicit Evidence Integration [40.80948875051806]
We introduce a novel Clinical Trial Result Prediction (CTRP) task.
In the CTRP framework, a model takes a PICO-formatted clinical trial proposal with its background as input and predicts the result.
We exploit large-scale unstructured sentences from medical literature that implicitly contain PICOs and results as evidence.
arXiv Detail & Related papers (2020-10-12T12:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.