Multi-domain Clinical Natural Language Processing with MedCAT: the
Medical Concept Annotation Toolkit
- URL: http://arxiv.org/abs/2010.01165v2
- Date: Thu, 25 Mar 2021 13:21:50 GMT
- Title: Multi-domain Clinical Natural Language Processing with MedCAT: the
Medical Concept Annotation Toolkit
- Authors: Zeljko Kraljevic, Thomas Searle, Anthony Shek, Lukasz Roguski, Kawsar
Noor, Daniel Bean, Aurelie Mascio, Leilei Zhu, Amos A Folarin, Angus Roberts,
Rebecca Bendayan, Mark P Richardson, Robert Stewart, Anoop D Shah, Wai Keong
Wong, Zina Ibrahim, James T Teo, Richard JB Dobson
- Abstract summary: We present the open-source Medical Concept EHR Toolkit (MedMedCAT)
It provides a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT.
We show improved performance in extracting UMLS concepts from open datasets.
Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over 8.8B words from 17M clinical records.
- Score: 5.49956798378633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Electronic health records (EHR) contain large volumes of unstructured text,
requiring the application of Information Extraction (IE) technologies to enable
clinical analysis. We present the open-source Medical Concept Annotation
Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning
algorithm for extracting concepts using any concept vocabulary including
UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and
training IE models; and c) integrations to the broader CogStack ecosystem for
vendor-agnostic health system deployment. We show improved performance in
extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650).
Further real-world validation demonstrates SNOMED-CT extraction at 3 large
London hospitals with self-supervised training over ~8.8B words from ~17M
clinical records and further fine-tuning with ~6K clinician annotated examples.
We show strong transferability (F1 > 0.94) between hospitals, datasets, and
concept types indicating cross-domain EHR-agnostic utility for accelerated
clinical and research use cases.
Related papers
- Document-level Clinical Entity and Relation Extraction via Knowledge Base-Guided Generation [0.869967783513041]
We leverage the Unified Medical Language System (UMLS) knowledge base to accurately identify medical concepts.
Our framework selects UMLS concepts relevant to the text and combines them with prompts to guide language models in extracting entities.
arXiv Detail & Related papers (2024-07-13T22:45:46Z) - GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models [1.123722364748134]
This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs)
The methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon.
The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98%.
arXiv Detail & Related papers (2024-05-31T02:53:22Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking.
It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Learnable Weight Initialization for Volumetric Medical Image Segmentation [66.3030435676252]
We propose a learnable weight-based hybrid medical image segmentation approach.
Our approach is easy to integrate into any hybrid model and requires no external training data.
Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-15T17:55:05Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - A Multimodal Transformer: Fusing Clinical Notes with Structured EHR Data
for Interpretable In-Hospital Mortality Prediction [8.625186194860696]
We provide a novel multimodal transformer to fuse clinical notes and structured EHR data for better prediction of in-hospital mortality.
To improve interpretability, we propose an integrated gradients (IG) method to select important words in clinical notes.
We also investigate the significance of domain adaptive pretraining and task adaptive fine-tuning on the Clinical BERT.
arXiv Detail & Related papers (2022-08-09T03:49:52Z) - Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG)
CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure.
Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z) - Unifying Heterogenous Electronic Health Records Systems via Text-Based
Code Embedding [7.3394352452936085]
We introduceDescription-based Embedding,DescEmb, a code-agnostic representation learning framework for EHR.
DescEmb takes advantage of the flexibil-ity of neural language understanding models toembed clinical events using their textual descrip-tions rather than directly mapping each event to a dedicated embedding.
arXiv Detail & Related papers (2021-11-12T20:27:55Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.