Atypical lexical abbreviations identification in Russian medical texts
- URL: http://arxiv.org/abs/2206.01987v1
- Date: Sat, 4 Jun 2022 13:16:08 GMT
- Title: Atypical lexical abbreviations identification in Russian medical texts
- Authors: Anna Berdichevskaia (NUST "MISiS")
- Abstract summary: We propose an efficient ML-based algorithm which allows to identify the abbreviations in Russian texts.
The method achieves ROC AUC score 0.926 and F1 score 0.706 which are confirmed as competitive.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Abbreviation is a method of word formation that aims to construct the
shortened term from the first letters of the initial phrase. Implicit
abbreviations frequently cause the comprehension difficulties for unprepared
readers. In this paper, we propose an efficient ML-based algorithm which allows
to identify the abbreviations in Russian texts. The method achieves ROC AUC
score 0.926 and F1 score 0.706 which are confirmed as competitive in comparison
with the baselines. Along with the pipeline, we also establish first to our
knowledge Russian dataset that is relevant for the desired task.
Related papers
- Evaluating and Improving ChatGPT-Based Expansion of Abbreviations [6.900119856872516]
We present the first empirical study on large language models (LLMs)-based abbreviation expansion.
Our evaluation results suggest that ChatGPT is substantially less accurate than the state-of-the-art approach.
In response to the first cause, we investigated the effect of various contexts and found surrounding source code is the best selection.
arXiv Detail & Related papers (2024-10-31T12:20:24Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - Dealing with Abbreviations in the Slovenian Biographical Lexicon [2.0810096547938164]
Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors.
We propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text.
arXiv Detail & Related papers (2022-11-04T13:09:02Z) - Token Classification for Disambiguating Medical Abbreviations [0.0]
Abbreviations are unavoidable yet critical parts of the medical text.
Lack of a standardized mapping system makes disambiguating abbreviations a difficult and time-consuming task.
arXiv Detail & Related papers (2022-10-05T18:06:49Z) - Structured abbreviation expansion in context [12.000998471674649]
We consider the task of reversing ad hoc abbreviations in context to recover normalized, expanded versions of abbreviated messages.
The problem is related to, but distinct from, spelling correction, in that ad hoc abbreviations are intentional and may involve substantial differences from the original words.
arXiv Detail & Related papers (2021-10-04T01:22:43Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.