Related papers: Self-Alignment Pretraining for Biomedical Entity Representations

Self-Alignment Pretraining for Biomedical Entity Representations

URL: http://arxiv.org/abs/2010.11784v2
Date: Wed, 7 Apr 2021 11:01:50 GMT
Title: Self-Alignment Pretraining for Biomedical Entity Representations
Authors: Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, Nigel Collier
Abstract summary: We propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical entities.
Score: 37.09383468126953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.

Related papers

m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training [8.238980609871042]
We propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for biomedical training. Our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality data.
arXiv Detail & Related papers (2025-04-28T08:18:24Z)
Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training [10.701353329227722]
We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain.
arXiv Detail & Related papers (2025-01-25T07:20:44Z)
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset. This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z)
Diversifying Knowledge Enhancement of Biomedical Language Models using Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models. We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT. We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z)
Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction. We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z)
BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs [27.32543389443672]
We present BioBridge, a novel parameter-efficient learning framework to bridge independently trained unimodal FMs to establish multimodal behavior. Our empirical results demonstrate that BioBridge can beat the best baseline KG embedding methods. We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations.
arXiv Detail & Related papers (2023-10-05T05:30:42Z)
Biomedical Language Models are Robust to Sub-optimal Tokenization [30.175714262031253]
Most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers. We find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
arXiv Detail & Related papers (2023-06-30T13:35:24Z)
Interpretability from a new lens: Integrating Stratification and Domain knowledge for Biomedical Applications [0.0]
This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs) This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models.
arXiv Detail & Related papers (2023-03-15T12:02:02Z)
Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation. GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z)
Evaluating Biomedical BERT Models for Vocabulary Alignment at Scale in the UMLS Metathesaurus [8.961270657070942]
The current UMLS (Unified Medical Language System) Metathesaurus construction process is expensive and error-prone. Recent advances in Natural Language Processing have achieved state-of-the-art (SOTA) performance on downstream tasks. We aim to validate if approaches using the BERT models can actually outperform the existing approaches for predicting synonymy in the UMLS Metathesaurus.
arXiv Detail & Related papers (2021-09-14T16:52:16Z)
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process. By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z)
BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition [9.05154470433578]
Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models. We propose biomedical ALBERT, an effective domain-specific language model trained on large-scale biomedical corpora.
arXiv Detail & Related papers (2020-09-19T12:58:47Z)
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.