Detecting Idiomatic Multiword Expressions in Clinical Terminology using
Definition-Based Representation Learning
- URL: http://arxiv.org/abs/2305.06801v1
- Date: Thu, 11 May 2023 13:42:58 GMT
- Title: Detecting Idiomatic Multiword Expressions in Clinical Terminology using
Definition-Based Representation Learning
- Authors: Fran\c{c}ois Remy, Alfiya Khabibullina, Thomas Demeester
- Abstract summary: We develop an effective tool for scoring the idiomaticity of biomedical MWEs based on the degree of similarity between the semantic representations of those MWEs and a weighted average of the representation of their constituents.
Our results show that the BioLORD model has a strong ability to identify idiomatic MWEs, not replicated in other models.
- Score: 12.30055843580139
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper shines a light on the potential of definition-based semantic
models for detecting idiomatic and semi-idiomatic multiword expressions (MWEs)
in clinical terminology. Our study focuses on biomedical entities defined in
the UMLS ontology and aims to help prioritize the translation efforts of these
entities. In particular, we develop an effective tool for scoring the
idiomaticity of biomedical MWEs based on the degree of similarity between the
semantic representations of those MWEs and a weighted average of the
representation of their constituents. We achieve this using a biomedical
language model trained to produce similar representations for entity names and
their definitions, called BioLORD. The importance of this definition-based
approach is highlighted by comparing the BioLORD model to two other
state-of-the-art biomedical language models based on Transformer: SapBERT and
CODER. Our results show that the BioLORD model has a strong ability to identify
idiomatic MWEs, not replicated in other models. Our corpus-free idiomaticity
estimation helps ontology translators to focus on more challenging MWEs.
Related papers
- Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques [0.0]
Multiple terms can refer to the same core concepts which can be referred as a clinical entity.
Ontologies like the Unified Medical Language System (UMLS) are developed and maintained to store millions of clinical entities.
We propose a suite of context-based and context-less remention techniques for performing the entity disambiguation.
arXiv Detail & Related papers (2024-05-24T01:14:33Z) - Multi-level biomedical NER through multi-granularity embeddings and
enhanced labeling [3.8599767910528917]
This paper proposes a hybrid approach that integrates the strengths of multiple models.
BERT provides contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text.
We evaluate our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11.
arXiv Detail & Related papers (2023-12-24T21:45:36Z) - UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for
Biomedical Entity Recognition [4.865221751784403]
This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.
Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
arXiv Detail & Related papers (2023-07-20T18:08:34Z) - Agentivit\`a e telicit\`a in GilBERTo: implicazioni cognitive [77.71680953280436]
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics.
The semantic properties considered are telicity (also combined with definiteness) and agentivity.
arXiv Detail & Related papers (2023-07-06T10:52:22Z) - Biomedical Language Models are Robust to Sub-optimal Tokenization [30.175714262031253]
Most modern biomedical language models (LMs) are pre-trained using standard domain-specific tokenizers.
We find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
arXiv Detail & Related papers (2023-06-30T13:35:24Z) - BioLORD: Learning Ontological Representations from Definitions (for
Biomedical Concepts and their Textual Descriptions) [17.981285086380147]
BioLORD is a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts.
Because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations.
BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph.
arXiv Detail & Related papers (2022-10-21T11:43:59Z) - Clinical Named Entity Recognition using Contextualized Token
Representations [49.036805795072645]
This paper introduces the technique of contextualized word embedding to better capture the semantic meaning of each word based on its context.
We pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair)
Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
arXiv Detail & Related papers (2021-06-23T18:12:58Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - A Comparative Study of Lexical Substitution Approaches based on Neural
Language Models [117.96628873753123]
We present a large-scale comparative study of popular neural language and masked language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly.
arXiv Detail & Related papers (2020-05-29T18:43:22Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.