Can Embeddings Adequately Represent Medical Terminology? New Large-Scale
Medical Term Similarity Datasets Have the Answer!
- URL: http://arxiv.org/abs/2003.11082v1
- Date: Tue, 24 Mar 2020 19:18:34 GMT
- Title: Can Embeddings Adequately Represent Medical Terminology? New Large-Scale
Medical Term Similarity Datasets Have the Answer!
- Authors: Claudia Schulz, Damir Juric
- Abstract summary: A large number of embeddings trained on medical data have emerged, but it remains unclear how well they represent medical terminology.
We present multiple automatically created large-scale medical term similarity datasets.
We evaluate state-of-the-art word and contextual embeddings on our new datasets, comparing multiple vector similarity metrics and word vector aggregation techniques.
- Score: 13.885093944392464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large number of embeddings trained on medical data have emerged, but it
remains unclear how well they represent medical terminology, in particular
whether the close relationship of semantically similar medical terms is encoded
in these embeddings. To date, only small datasets for testing medical term
similarity are available, not allowing to draw conclusions about the
generalisability of embeddings to the enormous amount of medical terms used by
doctors. We present multiple automatically created large-scale medical term
similarity datasets and confirm their high quality in an annotation study with
doctors. We evaluate state-of-the-art word and contextual embeddings on our new
datasets, comparing multiple vector similarity metrics and word vector
aggregation techniques. Our results show that current embeddings are limited in
their ability to adequately encode medical terms. The novel datasets thus form
a challenging new benchmark for the development of medical embeddings able to
accurately represent the whole medical terminology.
Related papers
- Semantic Textual Similarity Assessment in Chest X-ray Reports Using a
Domain-Specific Cosine-Based Metric [1.7802147489386628]
We introduce a novel approach designed specifically for assessing the semantic similarity between generated medical reports and the ground truth.
Our approach is validated, demonstrating its efficiency in assessing domain-specific semantic similarity within medical contexts.
arXiv Detail & Related papers (2024-02-19T07:48:25Z) - Leveraging knowledge graphs to update scientific word embeddings using
latent semantic imputation [0.0]
We show how glslsi can impute embeddings for domain-specific words from up-to-date knowledge graphs.
We show that LSI can produce reliable embedding vectors for rare and OOV terms in the biomedical domain.
arXiv Detail & Related papers (2022-10-27T12:15:26Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Clinical Named Entity Recognition using Contextualized Token
Representations [49.036805795072645]
This paper introduces the technique of contextualized word embedding to better capture the semantic meaning of each word based on its context.
We pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair)
Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
arXiv Detail & Related papers (2021-06-23T18:12:58Z) - Unifying Relational Sentence Generation and Retrieval for Medical Image
Report Composition [142.42920413017163]
Current methods often generate the most common sentences due to dataset bias for individual case.
We propose a novel framework that unifies template retrieval and sentence generation to handle both common and rare abnormality.
arXiv Detail & Related papers (2021-01-09T04:33:27Z) - CODER: Knowledge infused cross-lingual medical term embedding for term
normalization [7.516391006265378]
CODER is designed for medical term normalization by providing close vector representations for different terms.
We train CODER via contrastive learning on a medical knowledge graph (KG) named the Unified Medical Language System.
We evaluate CODER in zero-shot term normalization, semantic similarity, and relation classification benchmarks.
arXiv Detail & Related papers (2020-11-05T16:16:49Z) - Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities.
This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.
We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z) - Domain Generalization for Medical Imaging Classification with
Linear-Dependency Regularization [59.5104563755095]
We introduce a simple but effective approach to improve the generalization capability of deep neural networks in the field of medical imaging classification.
Motivated by the observation that the domain variability of the medical images is to some extent compact, we propose to learn a representative feature space through variational encoding.
arXiv Detail & Related papers (2020-09-27T12:30:30Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z) - Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain [1.3526604206343171]
Interpretability is a key means to justification which is an integral part when it comes to biomedical applications.
We present an inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods.
Based on our experiments, it is seen that sparse word vectors show far more interpretability while preserving the performance of their original vectors in downstream tasks.
arXiv Detail & Related papers (2020-05-11T13:56:58Z) - Seeing The Whole Patient: Using Multi-Label Medical Text Classification
Techniques to Enhance Predictions of Medical Codes [2.158285012874102]
We present results of multi-label medical text classification problems with 18, 50 and 155 labels.
For imbalanced data we show that labels which occur infrequently, benefit the most from additional features incorporated in embeddings.
High dimensional embeddings from this research are made available for public use.
arXiv Detail & Related papers (2020-03-29T02:19:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.