Biomedical Language Models are Robust to Sub-optimal Tokenization
- URL: http://arxiv.org/abs/2306.17649v3
- Date: Mon, 10 Jul 2023 16:22:41 GMT
- Title: Biomedical Language Models are Robust to Sub-optimal Tokenization
- Authors: Bernal Jim\'enez Guti\'errez, Huan Sun, Yu Su
- Abstract summary: Most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers.
We find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
- Score: 30.175714262031253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As opposed to general English, many concepts in biomedical terminology have
been designed in recent history by biomedical professionals with the goal of
being precise and concise. This is often achieved by concatenating meaningful
biomedical morphemes to create new semantic units. Nevertheless, most modern
biomedical language models (LMs) are pre-trained using standard domain-specific
tokenizers derived from large scale biomedical corpus statistics without
explicitly leveraging the agglutinating nature of biomedical language. In this
work, we first find that standard open-domain and biomedical tokenizers are
largely unable to segment biomedical terms into meaningful components.
Therefore, we hypothesize that using a tokenizer which segments biomedical
terminology more accurately would enable biomedical LMs to improve their
performance on downstream biomedical NLP tasks, especially ones which involve
biomedical terms directly such as named entity recognition (NER) and entity
linking. Surprisingly, we find that pre-training a biomedical LM using a more
accurate biomedical tokenizer does not improve the entity representation
quality of a language model as measured by several intrinsic and extrinsic
measures such as masked language modeling prediction (MLM) accuracy as well as
NER and entity linking performance. These quantitative findings, along with a
case study which explores entity representation quality more directly, suggest
that the biomedical pre-training process is quite robust to instances of
sub-optimal tokenization.
Related papers
- Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - Detecting Idiomatic Multiword Expressions in Clinical Terminology using
Definition-Based Representation Learning [12.30055843580139]
We develop an effective tool for scoring the idiomaticity of biomedical MWEs based on the degree of similarity between the semantic representations of those MWEs and a weighted average of the representation of their constituents.
Our results show that the BioLORD model has a strong ability to identify idiomatic MWEs, not replicated in other models.
arXiv Detail & Related papers (2023-05-11T13:42:58Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Pre-trained Language Models in Biomedical Domain: A Systematic Survey [33.572502204216256]
Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks.
This paper summarizes the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks.
arXiv Detail & Related papers (2021-10-11T05:30:30Z) - Biomedical Interpretable Entity Representations [40.6095537182194]
Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks.
This can be a barrier to model uptake in important domains such as biomedicine.
We create a new entity type system and training set from a large corpus of biomedical texts.
arXiv Detail & Related papers (2021-06-17T13:52:10Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Biomedical Entity Linking with Contrastive Context Matching [5.2710726359379265]
We introduce BioCoM, a contrastive learning framework for biomedical entity linking.
We build the training instances from raw PubMed articles by dictionary matching.
We predict the normalized biomedical entity at inference time through a nearest-neighbor search.
arXiv Detail & Related papers (2021-06-14T16:43:33Z) - BioALBERT: A Simple and Effective Pre-trained Language Model for
Biomedical Named Entity Recognition [9.05154470433578]
Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models.
We propose biomedical ALBERT, an effective domain-specific language model trained on large-scale biomedical corpora.
arXiv Detail & Related papers (2020-09-19T12:58:47Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.