Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again
- URL: http://arxiv.org/abs/2203.08410v1
- Date: Wed, 16 Mar 2022 05:56:08 GMT
- Title: Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again
- Authors: Bernal Jim\'enez Guti\'errez, Nikolas McNeal, Clay Washington, You
Chen, Lang Li, Huan Sun, Yu Su
- Abstract summary: We present the first systematic and comprehensive study to compare the few-shot performance of GPT-3 in-context learning with fine-tuning smaller (i.e., BERT-sized) PLMs.
Our results show that GPT-3 still significantly underperforms compared with simply fine-tuning a smaller PLM using the same small training set.
- Score: 24.150464908060112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The strong few-shot in-context learning capability of large pre-trained
language models (PLMs) such as GPT-3 is highly appealing for biomedical
applications where data annotation is particularly costly. In this paper, we
present the first systematic and comprehensive study to compare the few-shot
performance of GPT-3 in-context learning with fine-tuning smaller (i.e.,
BERT-sized) PLMs on two highly representative biomedical information extraction
tasks, named entity recognition and relation extraction. We follow the true
few-shot setting to avoid overestimating models' few-shot performance by model
selection over a large validation set. We also optimize GPT-3's performance
with known techniques such as contextual calibration and dynamic in-context
example retrieval. However, our results show that GPT-3 still significantly
underperforms compared with simply fine-tuning a smaller PLM using the same
small training set. Moreover, what is equally important for practical
applications is that adding more labeled data would reliably yield an
improvement in model performance. While that is the case when fine-tuning small
PLMs, GPT-3's performance barely improves when adding more data. In-depth
analyses further reveal issues of the in-context learning setting that may be
detrimental to information extraction tasks in general. Given the high cost of
experimenting with GPT-3, we hope our study provides guidance for biomedical
researchers and practitioners towards more promising directions such as
fine-tuning GPT-3 or small PLMs.
Related papers
- Selecting Between BERT and GPT for Text Classification in Political Science Research [4.487884986288122]
We evaluate the effectiveness of BERT-based versus GPT-based models in low-data scenarios.
We conclude by comparing these approaches in terms of performance, ease of use, and cost.
arXiv Detail & Related papers (2024-11-07T07:29:39Z) - Text generation for dataset augmentation in security classification
tasks [55.70844429868403]
This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks.
We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
arXiv Detail & Related papers (2023-10-22T22:25:14Z) - Systematic Evaluation of GPT-3 for Zero-Shot Personality Estimation [12.777659013330823]
GPT-3 is used to estimate the Big 5 personality traits from users' social media posts.
We find that GPT-3 performance is somewhat close to an existing pre-trained SotA for broad classification.
We analyze where GPT-3 performs better, as well as worse, than a pretrained lexical model, illustrating systematic errors.
arXiv Detail & Related papers (2023-06-01T22:43:37Z) - Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3
(with Varying Success) [36.646495151276326]
GPT-3 is able to produce high quality summaries of general domain news articles in few- and zero-shot settings.
We enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision.
arXiv Detail & Related papers (2023-05-10T16:40:37Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Improving Short Text Classification With Augmented Data Using GPT-3 [0.0]
GPT-3 is a large-scale natural language model developed by OpenAI.
This study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples.
We find that while the augmented Completion achieves upwards of 80 percent validation accuracy, using the augmented Classification yields more consistent accuracy on unseen examples.
arXiv Detail & Related papers (2022-05-23T01:10:38Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z) - Reframing Instructional Prompts to GPTk's Language [72.69833640335519]
We propose reframing techniques for model designers to create effective prompts for language models.
Our results show that reframing improves few-shot learning performance by 14% while reducing sample complexity.
The performance gains are particularly important on large language models, such as GPT3 where tuning models or prompts on large datasets is not feasible.
arXiv Detail & Related papers (2021-09-16T09:44:43Z) - GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain [5.479164650793012]
We investigate the performance of two powerful transformer language models, i.e. GPT-3 and BioBERT, in few-shot settings on various biomedical NLP tasks.
GPT-3 had already achieved near state-of-the-art results in few-shot knowledge transfer on open-domain NLP tasks, but it could not perform as effectively as BioBERT.
arXiv Detail & Related papers (2021-09-06T15:50:37Z) - What Makes Good In-Context Examples for GPT-$3$? [101.99751777056314]
GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks.
Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples.
In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples.
arXiv Detail & Related papers (2021-01-17T23:38:40Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.