Generative Biomedical Entity Linking via Knowledge Base-Guided
Pre-training and Synonyms-Aware Fine-tuning
- URL: http://arxiv.org/abs/2204.05164v1
- Date: Mon, 11 Apr 2022 14:50:51 GMT
- Title: Generative Biomedical Entity Linking via Knowledge Base-Guided
Pre-training and Synonyms-Aware Fine-tuning
- Authors: Hongyi Yuan, Zheng Yuan, Sheng Yu
- Abstract summary: We propose a generative approach to model biomedical entity linking (EL)
We propose KB-guided pre-training by constructing synthetic samples with synonyms and definitions from KB.
We also propose synonyms-aware fine-tuning to select concept names for training, and propose decoder prompt and multi-synonyms constrained prefix tree for inference.
- Score: 0.8154691566915505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entities lie in the heart of biomedical natural language understanding, and
the biomedical entity linking (EL) task remains challenging due to the
fine-grained and diversiform concept names. Generative methods achieve
remarkable performances in general domain EL with less memory usage while
requiring expensive pre-training. Previous biomedical EL methods leverage
synonyms from knowledge bases (KB) which is not trivial to inject into a
generative method. In this work, we use a generative approach to model
biomedical EL and propose to inject synonyms knowledge in it. We propose
KB-guided pre-training by constructing synthetic samples with synonyms and
definitions from KB and require the model to recover concept names. We also
propose synonyms-aware fine-tuning to select concept names for training, and
propose decoder prompt and multi-synonyms constrained prefix tree for
inference. Our method achieves state-of-the-art results on several biomedical
EL tasks without candidate selection which displays the effectiveness of
proposed pre-training and fine-tuning strategies.
Related papers
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery [52.498055901649025]
Concept Bottleneck Models (CBMs) have been proposed to address the 'black-box' problem of deep neural networks.
We propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm.
Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model.
arXiv Detail & Related papers (2024-07-19T17:50:11Z) - BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation [4.477762005644463]
Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB)
BELHD builds upon the BioSyn (Sung et al., 2020) model introducing two crucial extensions.
Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches.
arXiv Detail & Related papers (2024-01-10T12:45:18Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Bi-Encoders based Species Normalization -- Pairwise Sentence Learning to
Rank [0.0]
We present a novel deep learning approach for named entity normalization, treating it as a pair-wise learning to rank problem.
We conduct experiments on species entity types and evaluate our method against state-of-the-art techniques.
arXiv Detail & Related papers (2023-10-22T17:30:16Z) - Biomedical Named Entity Recognition via Dictionary-based Synonym
Generalization [51.89486520806639]
We propose a novel Synonym Generalization (SynGen) framework that recognizes the biomedical concepts contained in the input text using span-based predictions.
We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins.
arXiv Detail & Related papers (2023-05-22T14:36:32Z) - Automatic Biomedical Term Clustering by Learning Fine-grained Term
Representations [0.8154691566915505]
State-of-the-art term embeddings leverage pretrained language models to encode terms and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning.
These embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering.
To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples.
We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.
arXiv Detail & Related papers (2022-04-01T12:30:58Z) - KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization
for Relation Extraction [111.74812895391672]
We propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt)
We inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words.
arXiv Detail & Related papers (2021-04-15T17:57:43Z) - A Lightweight Neural Model for Biomedical Entity Linking [1.8047694351309205]
We propose a lightweight neural method for biomedical entity linking.
Our method uses a simple alignment layer with attention mechanisms to capture the variations between mention and entity names.
Our model is competitive with previous work on standard evaluation benchmarks.
arXiv Detail & Related papers (2020-12-16T10:34:37Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using
Human Phenotype Ontology [6.165755812152143]
PhenoTagger is a hybrid method that combines both dictionary and machine learning-based methods to recognize concepts in unstructured text.
Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods.
arXiv Detail & Related papers (2020-09-17T18:00:43Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.