Pre-trained Language Models in Biomedical Domain: A Systematic Survey
- URL: http://arxiv.org/abs/2110.05006v4
- Date: Mon, 17 Jul 2023 03:58:28 GMT
- Title: Pre-trained Language Models in Biomedical Domain: A Systematic Survey
- Authors: Benyou Wang, Qianqian Xie, Jiahuan Pei, Zhihong Chen, Prayag Tiwari,
Zhao Li, and Jie fu
- Abstract summary: Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks.
This paper summarizes the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks.
- Score: 33.572502204216256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models (PLMs) have been the de facto paradigm for most
natural language processing (NLP) tasks. This also benefits biomedical domain:
researchers from informatics, medicine, and computer science (CS) communities
propose various PLMs trained on biomedical datasets, e.g., biomedical text,
electronic health records, protein, and DNA sequences for various biomedical
tasks. However, the cross-discipline characteristics of biomedical PLMs hinder
their spreading among communities; some existing works are isolated from each
other without comprehensive comparison and discussions. It expects a survey
that not only systematically reviews recent advances of biomedical PLMs and
their applications but also standardizes terminology and benchmarks. In this
paper, we summarize the recent progress of pre-trained language models in the
biomedical domain and their applications in biomedical downstream tasks.
Particularly, we discuss the motivations and propose a taxonomy of existing
biomedical PLMs. Their applications in biomedical downstream tasks are
exhaustively discussed. At last, we illustrate various limitations and future
trends, which we hope can provide inspiration for the future research of the
research community.
Related papers
- An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical
Knowledge Graph Insights [15.952942443163474]
We propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences.
We demonstrate consistent and substantial performance improvements over the previous state of the art.
Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages.
arXiv Detail & Related papers (2023-11-27T18:46:17Z) - Bio-SIEVE: Exploring Instruction Tuning Large Language Models for
Systematic Review Automation [6.452837513222072]
Large Language Models (LLMs) can support and be trained to perform literature screening for medical systematic reviews.
Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches.
We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process.
arXiv Detail & Related papers (2023-08-12T16:56:55Z) - Biomedical Language Models are Robust to Sub-optimal Tokenization [30.175714262031253]
Most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers.
We find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
arXiv Detail & Related papers (2023-06-30T13:35:24Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - BioBART: Pretraining and Evaluation of A Biomedical Generative Language
Model [1.1764594853212893]
In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain.
We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition.
BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks.
arXiv Detail & Related papers (2022-04-08T08:07:42Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.