Related papers: Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

URL: http://arxiv.org/abs/2010.01165v2
Date: Thu, 25 Mar 2021 13:21:50 GMT
Title: Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit
Authors: Zeljko Kraljevic, Thomas Searle, Anthony Shek, Lukasz Roguski, Kawsar Noor, Daniel Bean, Aurelie Mascio, Leilei Zhu, Amos A Folarin, Angus Roberts, Rebecca Bendayan, Mark P Richardson, Robert Stewart, Anoop D Shah, Wai Keong Wong, Zina Ibrahim, James T Teo, Richard JB Dobson
Abstract summary: We present the open-source Medical Concept EHR Toolkit (MedMedCAT) It provides a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT. We show improved performance in extracting UMLS concepts from open datasets. Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over 8.8B words from 17M clinical records.
Score: 5.49956798378633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

Related papers

Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks [0.31457219084519]
This study introduces a neural sequence labeling approach for SNOMED CT concept recognition using a Bidirectional GRU model.<n>We preprocess text with domain-adapted SpaCy and SciBERT-based tokenization, segmenting sentences into overlapping 19-token chunks enriched with contextual, syntactic, and morphological features.<n>The Bi-GRU model assigns IOB tags to identify concept spans and achieves strong performance with a 90 percent F1-score on the validation set.
arXiv Detail & Related papers (2025-08-04T16:08:49Z)
RelCAT: Advancing Extraction of Clinical Inter-Entity Relationships from Unstructured Electronic Health Records [1.9065879861609418]
RelCAT (Relation Concept Toolkit) is an interactive tool, library, and workflow designed to classify relations between entities extracted from clinical narratives. The toolkit implements state-of-the-art machine learning models such as BERT and Llama along with proven evaluation and training methods. We demonstrate a dataset annotation tool (built within MedCATTrainer), model training, and evaluate our methodology on both openly available gold-standard and real-world UK National Health Service (NHS) hospital clinical datasets.
arXiv Detail & Related papers (2025-01-27T14:26:47Z)
Document-level Clinical Entity and Relation Extraction via Knowledge Base-Guided Generation [0.869967783513041]
We leverage the Unified Medical Language System (UMLS) knowledge base to accurately identify medical concepts. Our framework selects UMLS concepts relevant to the text and combines them with prompts to guide language models in extracting entities.
arXiv Detail & Related papers (2024-07-13T22:45:46Z)
GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models [1.123722364748134]
This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs) The methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon. The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98%.
arXiv Detail & Related papers (2024-05-31T02:53:22Z)
EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation [22.94521527609479]
EMERGE is a Retrieval-Augmented Generation (RAG) driven framework to enhance multimodal EHR predictive modeling. We extract entities from time-series data and clinical notes by prompting Large Language Models (LLMs) and align them with professional PrimeKG. The extracted knowledge is then used to generate task-relevant summaries of patients' health statuses.
arXiv Detail & Related papers (2024-05-27T10:53:15Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking. It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
Learnable Weight Initialization for Volumetric Medical Image Segmentation [66.3030435676252]
We propose a learnable weight-based hybrid medical image segmentation approach. Our approach is easy to integrate into any hybrid model and requires no external training data. Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-15T17:55:05Z)
Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain. We annotated a corpus of clinical documents according to 12 types of identifying entities. We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z)
A Multimodal Transformer: Fusing Clinical Notes with Structured EHR Data for Interpretable In-Hospital Mortality Prediction [8.625186194860696]
We provide a novel multimodal transformer to fuse clinical notes and structured EHR data for better prediction of in-hospital mortality. To improve interpretability, we propose an integrated gradients (IG) method to select important words in clinical notes. We also investigate the significance of domain adaptive pretraining and task adaptive fine-tuning on the Clinical BERT.
arXiv Detail & Related papers (2022-08-09T03:49:52Z)
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG) CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure. Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)
Unifying Heterogenous Electronic Health Records Systems via Text-Based Code Embedding [7.3394352452936085]
We introduceDescription-based Embedding,DescEmb, a code-agnostic representation learning framework for EHR. DescEmb takes advantage of the flexibil-ity of neural language understanding models toembed clinical events using their textual descrip-tions rather than directly mapping each event to a dedicated embedding.
arXiv Detail & Related papers (2021-11-12T20:27:55Z)
A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.