CoRTEx: Contrastive Learning for Representing Terms via Explanations
with Applications on Constructing Biomedical Knowledge Graphs
- URL: http://arxiv.org/abs/2312.08036v1
- Date: Wed, 13 Dec 2023 10:29:34 GMT
- Title: CoRTEx: Contrastive Learning for Representing Terms via Explanations
with Applications on Constructing Biomedical Knowledge Graphs
- Authors: Huaiyuan Ying, Zhengyun Zhao, Yang Zhao, Sihang Zeng, Sheng Yu
- Abstract summary: Previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms.
We leverage the world knowledge from Language Models (LLMs) to enhance term representation and significantly improves term clustering.
- Score: 9.328980260014216
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Objective: Biomedical Knowledge Graphs play a pivotal role in various
biomedical research domains. Concurrently, term clustering emerges as a crucial
step in constructing these knowledge graphs, aiming to identify synonymous
terms. Due to a lack of knowledge, previous contrastive learning models trained
with Unified Medical Language System (UMLS) synonyms struggle at clustering
difficult terms and do not generalize well beyond UMLS terms. In this work, we
leverage the world knowledge from Large Language Models (LLMs) and propose
Contrastive Learning for Representing Terms via Explanations (CoRTEx) to
enhance term representation and significantly improves term clustering.
Materials and Methods: The model training involves generating explanations for
a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning,
considering term and explanation embeddings simultaneously, and progressively
introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH
algorithm is designed for efficient clustering of a new ontology. Results: We
established a clustering test set and a hard negative test set, where our model
consistently achieves the highest F1 score. With CoRTEx embeddings and the
modified BIRCH algorithm, we grouped 35,580,932 terms from the Biomedical
Informatics Ontology System (BIOS) into 22,104,559 clusters with O(N) queries
to ChatGPT. Case studies highlight the model's efficacy in handling challenging
samples, aided by information from explanations. Conclusion: By aligning terms
to their explanations, CoRTEx demonstrates superior accuracy over benchmark
models and robustness beyond its training set, and it is suitable for
clustering terms for large-scale biomedical ontologies.
Related papers
- Document-level Clinical Entity and Relation Extraction via Knowledge Base-Guided Generation [0.869967783513041]
We leverage the Unified Medical Language System (UMLS) knowledge base to accurately identify medical concepts.
Our framework selects UMLS concepts relevant to the text and combines them with prompts to guide language models in extracting entities.
arXiv Detail & Related papers (2024-07-13T22:45:46Z) - Towards Ontology-Enhanced Representation Learning for Large Language Models [0.18416014644193066]
We propose a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing knowledge by a reference ontology.
The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) are utilized to compile a comprehensive set of concept definitions.
These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework.
arXiv Detail & Related papers (2024-05-30T23:01:10Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - Interpretable Solutions for Breast Cancer Diagnosis with Grammatical
Evolution and Data Augmentation [0.15705429611931054]
We show how a new synthetic data generation technique, STEM, can be used to produce data to train models produced by Grammatical Evolution (GE)
We test our technique on the Digital Database for Screening Mammography (DDSM) and the Wisconsin Breast Cancer (WBC) datasets.
We demonstrate that the GE-derived models present the best AUC while still maintaining interpretable solutions.
arXiv Detail & Related papers (2024-01-25T15:45:28Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented
Prompting [33.1455954220194]
HiPrompt is a supervision-efficient knowledge fusion framework.
It elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts.
Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt.
arXiv Detail & Related papers (2023-04-12T16:54:26Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - Automatic Biomedical Term Clustering by Learning Fine-grained Term
Representations [0.8154691566915505]
State-of-the-art term embeddings leverage pretrained language models to encode terms and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning.
These embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering.
To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples.
We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.
arXiv Detail & Related papers (2022-04-01T12:30:58Z) - Neighborhood Contrastive Learning for Novel Class Discovery [79.14767688903028]
We build a new framework, named Neighborhood Contrastive Learning, to learn discriminative representations that are important to clustering performance.
We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-06-20T17:34:55Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - A Teacher-Student Framework for Semi-supervised Medical Image
Segmentation From Mixed Supervision [62.4773770041279]
We develop a semi-supervised learning framework based on a teacher-student fashion for organ and lesion segmentation.
We show our model is robust to the quality of bounding box and achieves comparable performance compared with full-supervised learning methods.
arXiv Detail & Related papers (2020-10-23T07:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.