BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical
Knowledge Graph Insights
- URL: http://arxiv.org/abs/2311.16075v1
- Date: Mon, 27 Nov 2023 18:46:17 GMT
- Title: BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical
Knowledge Graph Insights
- Authors: Fran\c{c}ois Remy and Kris Demuynck and Thomas Demeester
- Abstract summary: We propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences.
We demonstrate consistent and substantial performance improvements over the previous state of the art.
Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages.
- Score: 15.952942443163474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we investigate the potential of Large Language Models to
complement biomedical knowledge graphs in the training of semantic models for
the biomedical and clinical domains. Drawing on the wealth of the UMLS
knowledge graph and harnessing cutting-edge Large Language Models, we propose a
new state-of-the-art approach for obtaining high-fidelity representations of
biomedical concepts and sentences, consisting of three steps: an improved
contrastive learning phase, a novel self-distillation phase, and a weight
averaging phase. Through rigorous evaluations via the extensive BioLORD testing
suite and diverse downstream tasks, we demonstrate consistent and substantial
performance improvements over the previous state of the art (e.g. +2pts on
MedSTS, +2.5pts on MedNLI-S, +6.1pts on EHR-Rel-B). Besides our new
state-of-the-art biomedical model for English, we also distill and release a
multilingual model compatible with 50+ languages and finetuned on 7 European
languages. Many clinical pipelines can benefit from our latest models. Our new
multilingual model enables a range of languages to benefit from our
advancements in biomedical semantic representation learning, opening a new
avenue for bioinformatics researchers around the world. As a result, we hope to
see BioLORD-2023 becoming a precious tool for future biomedical applications.
Related papers
- Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Exploring the In-context Learning Ability of Large Language Model for
Biomedical Concept Linking [4.8882241537236455]
This research investigates a method that exploits the in-context learning capabilities of large models for biomedical concept linking.
The proposed approach adopts a two-stage retrieve-and-rank framework.
It achieved an accuracy of 90.% in BC5CDR disease entity normalization and 94.7% in chemical entity normalization.
arXiv Detail & Related papers (2023-07-03T16:19:50Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs [48.376109878173956]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Pre-trained Language Models in Biomedical Domain: A Systematic Survey [33.572502204216256]
Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks.
This paper summarizes the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks.
arXiv Detail & Related papers (2021-10-11T05:30:30Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Conceptualized Representation Learning for Chinese Biomedical Text
Mining [14.77516568767045]
We investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora.
For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations.
arXiv Detail & Related papers (2020-08-25T04:41:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.