Self-Alignment Pretraining for Biomedical Entity Representations
- URL: http://arxiv.org/abs/2010.11784v2
- Date: Wed, 7 Apr 2021 11:01:50 GMT
- Title: Self-Alignment Pretraining for Biomedical Entity Representations
- Authors: Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, Nigel
Collier
- Abstract summary: We propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities.
We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical entities.
- Score: 37.09383468126953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the widespread success of self-supervised learning via masked
language models (MLM), accurately capturing fine-grained semantic relationships
in the biomedical domain remains a challenge. This is of paramount importance
for entity-level tasks such as entity linking where the ability to model entity
relations (especially synonymy) is pivotal. To address this challenge, we
propose SapBERT, a pretraining scheme that self-aligns the representation space
of biomedical entities. We design a scalable metric learning framework that can
leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts.
In contrast with previous pipeline-based hybrid systems, SapBERT offers an
elegant one-model-for-all solution to the problem of medical entity linking
(MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking
datasets. In the scientific domain, we achieve SOTA even without task-specific
supervision. With substantial improvement over various domain-specific
pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining
scheme proves to be both effective and robust.
Related papers
- Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction.
We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z) - BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs [27.32543389443672]
We present BioBridge, a novel parameter-efficient learning framework to bridge independently trained unimodal FMs to establish multimodal behavior.
Our empirical results demonstrate that BioBridge can beat the best baseline KG embedding methods.
We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations.
arXiv Detail & Related papers (2023-10-05T05:30:42Z) - Biomedical Language Models are Robust to Sub-optimal Tokenization [30.175714262031253]
Most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers.
We find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
arXiv Detail & Related papers (2023-06-30T13:35:24Z) - Interpretability from a new lens: Integrating Stratification and Domain
knowledge for Biomedical Applications [0.0]
This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs)
This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models.
arXiv Detail & Related papers (2023-03-15T12:02:02Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Evaluating Biomedical BERT Models for Vocabulary Alignment at Scale in
the UMLS Metathesaurus [8.961270657070942]
The current UMLS (Unified Medical Language System) Metathesaurus construction process is expensive and error-prone.
Recent advances in Natural Language Processing have achieved state-of-the-art (SOTA) performance on downstream tasks.
We aim to validate if approaches using the BERT models can actually outperform the existing approaches for predicting synonymy in the UMLS Metathesaurus.
arXiv Detail & Related papers (2021-09-14T16:52:16Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - BioALBERT: A Simple and Effective Pre-trained Language Model for
Biomedical Named Entity Recognition [9.05154470433578]
Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models.
We propose biomedical ALBERT, an effective domain-specific language model trained on large-scale biomedical corpora.
arXiv Detail & Related papers (2020-09-19T12:58:47Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.