Automatic Biomedical Term Clustering by Learning Fine-grained Term
Representations
- URL: http://arxiv.org/abs/2204.00391v1
- Date: Fri, 1 Apr 2022 12:30:58 GMT
- Title: Automatic Biomedical Term Clustering by Learning Fine-grained Term
Representations
- Authors: Sihang Zeng, Zheng Yuan, Sheng Yu
- Abstract summary: State-of-the-art term embeddings leverage pretrained language models to encode terms and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning.
These embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering.
To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples.
We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.
- Score: 0.8154691566915505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Term clustering is important in biomedical knowledge graph construction.
Using similarities between terms embedding is helpful for term clustering.
State-of-the-art term embeddings leverage pretrained language models to encode
terms, and use synonyms and relation knowledge from knowledge graphs to guide
contrastive learning. These embeddings provide close embeddings for terms
belonging to the same concept. However, from our probing experiments, these
embeddings are not sensitive to minor textual differences which leads to
failure for biomedical term clustering. To alleviate this problem, we adjust
the sampling strategy in pretraining term embeddings by providing dynamic hard
positive and negative samples during contrastive learning to learn fine-grained
representations which result in better biomedical term clustering. We name our
proposed method as CODER++, and it has been applied in clustering biomedical
concepts in the newly released Biomedical Knowledge Graph named BIOS.
Related papers
- Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - CoRTEx: Contrastive Learning for Representing Terms via Explanations
with Applications on Constructing Biomedical Knowledge Graphs [9.328980260014216]
Previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms.
We leverage the world knowledge from Language Models (LLMs) to enhance term representation and significantly improves term clustering.
arXiv Detail & Related papers (2023-12-13T10:29:34Z) - Hierarchical Pretraining for Biomedical Term Embeddings [4.69793648771741]
We propose HiPrBERT, a novel biomedical term representation model trained on hierarchical data.
We show that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in a substantially more informative embeddings for further biomedical applications.
arXiv Detail & Related papers (2023-07-01T08:16:00Z) - Biomedical Named Entity Recognition via Dictionary-based Synonym
Generalization [51.89486520806639]
We propose a novel Synonym Generalization (SynGen) framework that recognizes the biomedical concepts contained in the input text using span-based predictions.
We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins.
arXiv Detail & Related papers (2023-05-22T14:36:32Z) - BioLORD: Learning Ontological Representations from Definitions (for
Biomedical Concepts and their Textual Descriptions) [17.981285086380147]
BioLORD is a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts.
Because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations.
BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph.
arXiv Detail & Related papers (2022-10-21T11:43:59Z) - LifeLonger: A Benchmark for Continual Disease Classification [59.13735398630546]
We introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection.
Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch.
Cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge.
arXiv Detail & Related papers (2022-04-12T12:25:05Z) - Clinical Named Entity Recognition using Contextualized Token
Representations [49.036805795072645]
This paper introduces the technique of contextualized word embedding to better capture the semantic meaning of each word based on its context.
We pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair)
Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
arXiv Detail & Related papers (2021-06-23T18:12:58Z) - Federated Semi-supervised Medical Image Classification via Inter-client
Relation Matching [58.26619456972598]
Federated learning (FL) has emerged with increasing popularity to collaborate distributed medical institutions for training deep networks.
This paper studies a practical yet challenging FL problem, named textitFederated Semi-supervised Learning (FSSL)
We present a novel approach for this problem, which improves over traditional consistency regularization mechanism with a new inter-client relation matching scheme.
arXiv Detail & Related papers (2021-06-16T07:58:00Z) - End-to-end Biomedical Entity Linking with Span-based Dictionary Matching [5.273138059454523]
Disease name recognition and normalization is a fundamental process in biomedical text mining.
This study introduces a novel end-to-end approach that combines span representations with dictionary-matching features.
Our model handles unseen concepts by referring to a dictionary while maintaining the performance of neural network-based models.
arXiv Detail & Related papers (2021-04-21T12:24:12Z) - Disease Normalization with Graph Embeddings [12.70213916725476]
We train and test our methods on the known NCBI disease benchmark corpus.
We propose to represent disease names by leveraging MeSH's graphical structure together with the lexical information available in the taxonomy.
We also show that combining neural named entity recognition models with our graph-based entity linking methods via multitask learning leads to improved disease recognition in the NCBI corpus.
arXiv Detail & Related papers (2020-10-24T16:25:05Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.