MedJEx: A Medical Jargon Extraction Model with Wiki's Hyperlink Span and
Contextualized Masked Language Model Score
- URL: http://arxiv.org/abs/2210.05875v1
- Date: Wed, 12 Oct 2022 02:27:32 GMT
- Title: MedJEx: A Medical Jargon Extraction Model with Wiki's Hyperlink Span and
Contextualized Masked Language Model Score
- Authors: Sunjae Kwon, Zonghai Yao, Harmon S. Jordan, David A. Levy, Brian
Corner, Hong Yu
- Abstract summary: We present a novel and publicly available dataset with expert-annotated medical jargon terms from 18K+ EHR note sentences.
We then introduce a novel medical jargon extraction ($MedJEx$) model which has been shown to outperform existing state-of-the-art NLP models.
- Score: 6.208127495081593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a new natural language processing (NLP) application for
identifying medical jargon terms potentially difficult for patients to
comprehend from electronic health record (EHR) notes. We first present a novel
and publicly available dataset with expert-annotated medical jargon terms from
18K+ EHR note sentences ($MedJ$). Then, we introduce a novel medical jargon
extraction ($MedJEx$) model which has been shown to outperform existing
state-of-the-art NLP models. First, MedJEx improved the overall performance
when it was trained on an auxiliary Wikipedia hyperlink span dataset, where
hyperlink spans provide additional Wikipedia articles to explain the spans (or
terms), and then fine-tuned on the annotated MedJ data. Secondly, we found that
a contextualized masked language model score was beneficial for detecting
domain-specific unfamiliar jargon terms. Moreover, our results show that
training on the auxiliary Wikipedia hyperlink span datasets improved six out of
eight biomedical named entity recognition benchmark datasets. Both MedJ and
MedJEx are publicly available.
Related papers
- MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations [23.437292621092823]
We introduce MediTOD, a dataset of doctor-patient dialogues in English for the medical history-taking task.
We devise a questionnaire-based labeling scheme tailored to the medical domain.
Then, medical professionals create the dataset with high-quality comprehensive annotations.
arXiv Detail & Related papers (2024-10-18T06:38:22Z) - GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models [1.123722364748134]
This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs)
The methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon.
The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98%.
arXiv Detail & Related papers (2024-05-31T02:53:22Z) - MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain [9.91205505704257]
We present a systematic study on readability measurements in the medical domain at both sentence-level and span-level.
We introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex span annotation for 4,520 sentences.
We find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their correlation with human judgments.
arXiv Detail & Related papers (2024-05-03T14:48:20Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue
System Development [1.4315915057750197]
We publish a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations.
We propose a simple self-supervised training strategy with span-noise modelling that improves the performance.
arXiv Detail & Related papers (2023-04-27T17:59:53Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z) - COMETA: A Corpus for Medical Entity Linking in the Social Media [27.13349965075764]
We introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT.
Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality.
We shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios.
arXiv Detail & Related papers (2020-10-07T09:16:45Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.