Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
- URL: http://arxiv.org/abs/2303.04360v2
- Date: Mon, 10 Apr 2023 18:47:51 GMT
- Title: Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
- Authors: Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, Xia Hu
- Abstract summary: We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
- Score: 51.205078179427645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in large language models (LLMs) have led to the
development of highly potent models like OpenAI's ChatGPT. These models have
exhibited exceptional performance in a variety of tasks, such as question
answering, essay composition, and code generation. However, their effectiveness
in the healthcare sector remains uncertain. In this study, we seek to
investigate the potential of ChatGPT to aid in clinical text mining by
examining its ability to extract structured information from unstructured
healthcare texts, with a focus on biological named entity recognition and
relation extraction. However, our preliminary results indicate that employing
ChatGPT directly for these tasks resulted in poor performance and raised
privacy concerns associated with uploading patients' information to the ChatGPT
API. To overcome these limitations, we propose a new training paradigm that
involves generating a vast quantity of high-quality synthetic data with labels
utilizing ChatGPT and fine-tuning a local model for the downstream task. Our
method has resulted in significant improvements in the performance of
downstream tasks, improving the F1-score from 23.37% to 63.99% for the named
entity recognition task and from 75.86% to 83.59% for the relation extraction
task. Furthermore, generating data using ChatGPT can significantly reduce the
time and effort required for data collection and labeling, as well as mitigate
data privacy concerns. In summary, the proposed framework presents a promising
solution to enhance the applicability of LLM models to clinical text mining.
Related papers
- Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical
Study [38.555547784219115]
This research aims to investigate the ability of large language models, specifically ChatGPT, in the context of pharmacovigilance event extraction.
We conduct extensive experiments to assess the performance of ChatGPT in the pharmacovigilance event extraction task.
The inclusion of synthesized data into fine-tuning may lead to a decrease in performance, possibly attributed to noise in the ChatGPT-generated labels.
arXiv Detail & Related papers (2024-02-24T00:38:29Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Chatbots Are Not Reliable Text Annotators [0.0]
ChatGPT is a closed-source product which has major drawbacks with regards to transparency, cost, and data protection.
Recent advances in open-source (OS) large language models (LLMs) offer alternatives which remedy these challenges.
arXiv Detail & Related papers (2023-11-09T22:28:14Z) - CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study [17.96401880059829]
Large Language Models (LLMs) such as ChatGPT have achieved tremendous success in various downstream tasks.
We propose to use a knowledge graph as auxiliary information to guide the LLMs in making predictions.
Our few-shot learning method achieves satisfactory performance compared with fine-tuning strategies.
arXiv Detail & Related papers (2023-07-21T04:43:00Z) - Pushing the Limits of ChatGPT on NLP Tasks [79.17291002710517]
Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines.
In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors.
We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
arXiv Detail & Related papers (2023-06-16T09:40:05Z) - A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets.
Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT [2.320417845168326]
We investigate the use of data obtained from prompting a large generative language model, ChatGPT, to generate synthetic training data with the aim of augmenting data in low resource scenarios.
We show that with appropriate task-specific ChatGPT prompts, we outperform the most popular existing approaches for such data augmentation.
arXiv Detail & Related papers (2023-04-27T17:07:29Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.