Benchmarking and Analyzing In-context Learning, Fine-tuning and
Supervised Learning for Biomedical Knowledge Curation: a focused study on
chemical entities of biological interest
- URL: http://arxiv.org/abs/2312.12989v1
- Date: Wed, 20 Dec 2023 12:46:44 GMT
- Title: Benchmarking and Analyzing In-context Learning, Fine-tuning and
Supervised Learning for Biomedical Knowledge Curation: a focused study on
chemical entities of biological interest
- Authors: Emily Groves, Minhong Wang, Yusuf Abdulle, Holger Kunz, Jason
Hoelscher-Obermaier, Ronin Wu, Honghan Wu
- Abstract summary: This study compares and analyzes three NLP paradigms for curation: in-context learning (ICL), fine-tuning (FT), and supervised learning (ChML)
For ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT.
For ML, six embedding models were utilized for training Random Forest and Long-Short Term Memory models.
- Score: 2.8216292452982668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated knowledge curation for biomedical ontologies is key to ensure that
they remain comprehensive, high-quality and up-to-date. In the era of
foundational language models, this study compares and analyzes three NLP
paradigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and
supervised learning (ML). Using the Chemical Entities of Biological Interest
(ChEBI) database as a model ontology, three curation tasks were devised. For
ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT.
PubmedBERT was chosen for the FT paradigm. For ML, six embedding models were
utilized for training Random Forest and Long-Short Term Memory models. Five
setups were designed to assess ML and FT model performance across different
data availability scenarios.Datasets for curation tasks included: task 1
(620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive
versus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of
0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML
(trained on ~260,000 triples) outperformed ICL in accuracy across all tasks.
(accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed
similarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and
+.002), but worse in task 3 (-.048). Simulations revealed performance declines
in both ML and FT models with smaller and higher imbalanced training data.
where ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks
1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed
ML/FT in task 2.ICL-augmented foundation models can be good assistants for
knowledge curation with correct prompting, however, not making ML and FT
paradigms obsolete. The latter two require task-specific data to beat ICL. In
such cases, ML relies on small pretrained embeddings, minimizing computational
demands.
Related papers
- Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models? [16.312594953592665]
Large language models (LLMs) excel on generative tasks, but their performance on extractive tasks remains debated.
This study is among the first to develop and evaluate a comprehensive clinical IE system using open-source LLMs.
arXiv Detail & Related papers (2024-11-15T07:54:19Z) - Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z) - DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - A comparative study of zero-shot inference with large language models
and supervised modeling in breast cancer pathology classification [1.4715634464004446]
Large language models (LLMs) have demonstrated promising transfer learning capability.
LLMs demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for curating large annotated datasets.
This may result in an increase in the utilization of NLP-based variables and outcomes in observational clinical studies.
arXiv Detail & Related papers (2024-01-25T02:05:31Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing [10.698756010878688]
We created the BioInstruct, comprising 25,005 instructions to instruction-tune large language models (LLMs)
The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions.
We evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN)
arXiv Detail & Related papers (2023-10-30T19:38:50Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - A Novel Semi-supervised Meta Learning Method for Subject-transfer
Brain-computer Interface [7.372748737217638]
We propose a semi-supervised meta learning (S) method for subject-transfer learning in BCIs.
The proposed S learns a meta model with the existing subjects first, then fine-tunes the model in a semi-supervised learning manner.
It is significant for BCI applications where the labeled data are scarce or expensive while unlabeled data are readily available.
arXiv Detail & Related papers (2022-09-07T15:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.