ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish
- URL: http://arxiv.org/abs/2404.06367v1
- Date: Tue, 9 Apr 2024 15:04:27 GMT
- Title: ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish
- Authors: Fernando Gallego, Guillermo López-García, Luis Gasco-Sánchez, Martin Krallinger, Francisco J. Veredas,
- Abstract summary: This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking.
It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
- Score: 39.81302995670643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.
Related papers
- SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology [43.89160296332471]
We propose a method for linking text spans in clinical notes to specific concepts in the SNOMED CT using BERT-based models.
The method consists of two stages: candidate selection and candidate matching. The models were trained on one of the largest publicly available dataset of labeled clinical notes.
arXiv Detail & Related papers (2024-05-25T08:00:44Z) - Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques [0.0]
Multiple terms can refer to the same core concepts which can be referred as a clinical entity.
Ontologies like the Unified Medical Language System (UMLS) are developed and maintained to store millions of clinical entities.
We propose a suite of context-based and context-less remention techniques for performing the entity disambiguation.
arXiv Detail & Related papers (2024-05-24T01:14:33Z) - sEHR-CE: Language modelling of structured EHR data for efficient and
generalizable patient cohort expansion [0.0]
sEHR-CE is a novel framework based on transformers to enable integrated phenotyping and analyses of heterogeneous clinical datasets.
We validate our approach using primary and secondary care data from the UK Biobank, a large-scale research study.
arXiv Detail & Related papers (2022-11-30T16:00:43Z) - Developing a general-purpose clinical language inference model from a
large corpus of clinical notes [0.30586855806896046]
We trained a Bidomain Decoder from Transformers (BERT) model using a diverse, deidentified corpus of 75 million deidentified clinical notes authored at the University of California, San Francisco (UCSF)
Our model performs at par with the best publicly available biomedical language models of comparable sizes on the public benchmark tasks, and is significantly better than these models in a within-system evaluation on the two tasks using UCSF data.
arXiv Detail & Related papers (2022-10-12T20:08:45Z) - Cross-Lingual Knowledge Transfer for Clinical Phenotyping [55.92262310716537]
We investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language.
We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains.
Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.
arXiv Detail & Related papers (2022-08-03T08:33:21Z) - Biomedical and Clinical Language Models for Spanish: On the Benefits of
Domain-Specific Pretraining in a Mid-Resource Scenario [0.05277024349608833]
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices.
In the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model.
arXiv Detail & Related papers (2021-09-08T12:12:07Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.