Exploring the Effectiveness of Instruction Tuning in Biomedical Language
Processing
- URL: http://arxiv.org/abs/2401.00579v1
- Date: Sun, 31 Dec 2023 20:02:10 GMT
- Title: Exploring the Effectiveness of Instruction Tuning in Biomedical Language
Processing
- Authors: Omid Rohanian, Mohammadmahdi Nouriborji, David A. Clifton
- Abstract summary: This study investigates the potential of instruction tuning for biomedical language processing.
We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples.
- Score: 19.41164870575055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), particularly those similar to ChatGPT, have
significantly influenced the field of Natural Language Processing (NLP). While
these models excel in general language tasks, their performance in
domain-specific downstream tasks such as biomedical and clinical Named Entity
Recognition (NER), Relation Extraction (RE), and Medical Natural Language
Inference (NLI) is still evolving. In this context, our study investigates the
potential of instruction tuning for biomedical language processing, applying
this technique to two general LLMs of substantial scale. We present a
comprehensive, instruction-based model trained on a dataset that consists of
approximately $200,000$ instruction-focused samples. This dataset represents a
carefully curated compilation of existing data, meticulously adapted and
reformatted to align with the specific requirements of our instruction-based
tasks. This initiative represents an important step in utilising such models to
achieve results on par with specialised encoder-only models like BioBERT and
BioClinicalBERT for various classical biomedical NLP tasks. Our work includes
an analysis of the dataset's composition and its impact on model performance,
providing insights into the intricacies of instruction tuning. By sharing our
codes, models, and the distinctively assembled instruction-based dataset, we
seek to encourage ongoing research and development in this area.
Related papers
- Pre-training data selection for biomedical domain adaptation using journal impact metrics [0.0]
We employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set.
Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model's performance.
arXiv Detail & Related papers (2024-09-04T13:59:48Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for
Biomedical Entity Recognition [4.865221751784403]
This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.
Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
arXiv Detail & Related papers (2023-07-20T18:08:34Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - BigBIO: A Framework for Data-Centric Biomedical Natural Language
Processing [13.30221348538759]
We introduce BigBIO, a community library of 126+ biomedical NLP datasets.
BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata.
We discuss our process for task schema, data auditing, contribution guidelines, and outline two illustrative use cases.
arXiv Detail & Related papers (2022-06-30T07:15:45Z) - PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using
Transfer Learning [0.0]
PharmKE is a text analysis platform that applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles.
The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks.
The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset.
arXiv Detail & Related papers (2021-02-25T19:36:35Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.