COMETA: A Corpus for Medical Entity Linking in the Social Media
- URL: http://arxiv.org/abs/2010.03295v2
- Date: Thu, 8 Oct 2020 12:01:55 GMT
- Title: COMETA: A Corpus for Medical Entity Linking in the Social Media
- Authors: Marco Basaldella, Fangyu Liu, Ehsan Shareghi and Nigel Collier
- Abstract summary: We introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT.
Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality.
We shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios.
- Score: 27.13349965075764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whilst there has been growing progress in Entity Linking (EL) for general
language, existing datasets fail to address the complex nature of health
terminology in layman's language. Meanwhile, there is a growing need for
applications that can understand the public's voice in the health domain. To
address this we introduce a new corpus called COMETA, consisting of 20k English
biomedical entity mentions from Reddit expert-annotated with links to SNOMED
CT, a widely-used medical knowledge graph. Our corpus satisfies a combination
of desirable properties, from scale and coverage to diversity and quality, that
to the best of our knowledge has not been met by any of the existing resources
in the field. Through benchmark experiments on 20 EL baselines from string- to
neural-based models we shed light on the ability of these systems to perform
complex inference on entities and concepts under 2 challenging evaluation
scenarios. Our experimental results on COMETA illustrate that no golden bullet
exists and even the best mainstream techniques still have a significant
performance gap to fill, while the best solution relies on combining different
views of data.
Related papers
- Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques [0.0]
Multiple terms can refer to the same core concepts which can be referred as a clinical entity.
Ontologies like the Unified Medical Language System (UMLS) are developed and maintained to store millions of clinical entities.
We propose a suite of context-based and context-less remention techniques for performing the entity disambiguation.
arXiv Detail & Related papers (2024-05-24T01:14:33Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking.
It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Exploring the In-context Learning Ability of Large Language Model for
Biomedical Concept Linking [4.8882241537236455]
This research investigates a method that exploits the in-context learning capabilities of large models for biomedical concept linking.
The proposed approach adopts a two-stage retrieve-and-rank framework.
It achieved an accuracy of 90.% in BC5CDR disease entity normalization and 94.7% in chemical entity normalization.
arXiv Detail & Related papers (2023-07-03T16:19:50Z) - Health Status Prediction with Local-Global Heterogeneous Behavior Graph [69.99431339130105]
Estimation of health status can be achieved with various kinds of data streams continuously collected from wearable sensors.
We propose to model the behavior-related multi-source data streams with a local-global graph.
We take experiments on StudentLife dataset, and extensive results demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2021-03-23T11:10:04Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z) - MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware
Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG.
We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation.
Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z) - On the Combined Use of Extrinsic Semantic Resources for Medical
Information Search [0.0]
We develop a framework to highlight and expand head medical concepts in verbose medical queries.
We also build semantically enhanced inverted index documents.
To demonstrate the effectiveness of the proposed approach, we conducted several experiments over the CLEF 2014 dataset.
arXiv Detail & Related papers (2020-05-17T14:18:04Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.