Self-Supervised Detection of Contextual Synonyms in a Multi-Class
Setting: Phenotype Annotation Use Case
- URL: http://arxiv.org/abs/2109.01935v1
- Date: Sat, 4 Sep 2021 21:35:01 GMT
- Title: Self-Supervised Detection of Contextual Synonyms in a Multi-Class
Setting: Phenotype Annotation Use Case
- Authors: Jingqing Zhang, Luis Bolanos, Tong Li, Ashwani Tanwar, Guilherme
Freire, Xian Yang, Julia Ive, Vibhor Gupta, Yike Guo
- Abstract summary: Contextualised word embeddings is a powerful tool to detect contextual synonyms.
We propose a self-supervised pre-training approach which is able to detect contextual synonyms of concepts being training on the data created by shallow matching.
- Score: 11.912581294872767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contextualised word embeddings is a powerful tool to detect contextual
synonyms. However, most of the current state-of-the-art (SOTA) deep learning
concept extraction methods remain supervised and underexploit the potential of
the context. In this paper, we propose a self-supervised pre-training approach
which is able to detect contextual synonyms of concepts being training on the
data created by shallow matching. We apply our methodology in the sparse
multi-class setting (over 15,000 concepts) to extract phenotype information
from electronic health records. We further investigate data augmentation
techniques to address the problem of the class sparsity. Our approach achieves
a new SOTA for the unsupervised phenotype concept annotation on clinical text
on F1 and Recall outperforming the previous SOTA with a gain of up to 4.5 and
4.0 absolute points, respectively. After fine-tuning with as little as 20\% of
the labelled data, we also outperform BioBERT and ClinicalBERT. The extrinsic
evaluation on three ICU benchmarks also shows the benefit of using the
phenotypes annotated by our model as features.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - An Energy-based Model for Word-level AutoCompletion in Computer-aided Translation [97.3797716862478]
Word-level AutoCompletion (WLAC) is a rewarding yet challenging task in Computer-aided Translation.
Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label.
This work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence.
arXiv Detail & Related papers (2024-07-29T15:07:19Z) - Large-scale investigation of weakly-supervised deep learning for the
fine-grained semantic indexing of biomedical literature [7.171698704686836]
This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts.
The new method is evaluated on a large-scale retrospective scenario, based on concepts promoted to descriptors.
arXiv Detail & Related papers (2023-01-23T10:33:22Z) - PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for
Generalized Novel Category Discovery [39.03732147384566]
Generalized Novel Category Discovery (GNCD) setting aims to categorize unlabeled training data coming from known and novel classes.
We propose Contrastive Affinity Learning method with auxiliary visual Prompts, dubbed PromptCAL, to address this challenging problem.
Our approach discovers reliable pairwise sample affinities to learn better semantic clustering of both known and novel classes for the class token and visual prompts.
arXiv Detail & Related papers (2022-12-11T20:06:14Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Inserting Information Bottlenecks for Attribution in Transformers [46.77580577396633]
We apply information bottlenecks to analyze the attribution of each feature for prediction on a black-box model.
We show the effectiveness of our method in terms of attribution and the ability to provide insight into how information flows through layers.
arXiv Detail & Related papers (2020-12-27T00:35:43Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Multi-domain Clinical Natural Language Processing with MedCAT: the
Medical Concept Annotation Toolkit [5.49956798378633]
We present the open-source Medical Concept EHR Toolkit (MedMedCAT)
It provides a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT.
We show improved performance in extracting UMLS concepts from open datasets.
Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over 8.8B words from 17M clinical records.
arXiv Detail & Related papers (2020-10-02T19:01:02Z) - PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using
Human Phenotype Ontology [6.165755812152143]
PhenoTagger is a hybrid method that combines both dictionary and machine learning-based methods to recognize concepts in unstructured text.
Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods.
arXiv Detail & Related papers (2020-09-17T18:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.