Dealing with Abbreviations in the Slovenian Biographical Lexicon
- URL: http://arxiv.org/abs/2211.02429v1
- Date: Fri, 4 Nov 2022 13:09:02 GMT
- Title: Dealing with Abbreviations in the Slovenian Biographical Lexicon
- Authors: Angel Daza, Antske Fokkens, Toma\v{z} Erjavec
- Abstract summary: Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors.
We propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text.
- Score: 2.0810096547938164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Abbreviations present a significant challenge for NLP systems because they
cause tokenization and out-of-vocabulary errors. They can also make the text
less readable, especially in reference printed books, where they are
extensively used. Abbreviations are especially problematic in low-resource
settings, where systems are less robust to begin with. In this paper, we
propose a new method for addressing the problems caused by a high density of
domain-specific abbreviations in a text. We apply this method to the case of a
Slovenian biographical lexicon and evaluate it on a newly developed
gold-standard dataset of 51 Slovenian biographies. Our abbreviation
identification method performs significantly better than commonly used ad-hoc
solutions, especially at identifying unseen abbreviations. We also propose and
present the results of a method for expanding the identified abbreviations in
context.
Related papers
- Evaluating and Improving ChatGPT-Based Expansion of Abbreviations [6.900119856872516]
We present the first empirical study on large language models (LLMs)-based abbreviation expansion.
Our evaluation results suggest that ChatGPT is substantially less accurate than the state-of-the-art approach.
In response to the first cause, we investigated the effect of various contexts and found surrounding source code is the best selection.
arXiv Detail & Related papers (2024-10-31T12:20:24Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Token Classification for Disambiguating Medical Abbreviations [0.0]
Abbreviations are unavoidable yet critical parts of the medical text.
Lack of a standardized mapping system makes disambiguating abbreviations a difficult and time-consuming task.
arXiv Detail & Related papers (2022-10-05T18:06:49Z) - Atypical lexical abbreviations identification in Russian medical texts [0.0]
We propose an efficient ML-based algorithm which allows to identify the abbreviations in Russian texts.
The method achieves ROC AUC score 0.926 and F1 score 0.706 which are confirmed as competitive.
arXiv Detail & Related papers (2022-06-04T13:16:08Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Structured abbreviation expansion in context [12.000998471674649]
We consider the task of reversing ad hoc abbreviations in context to recover normalized, expanded versions of abbreviated messages.
The problem is related to, but distinct from, spelling correction, in that ad hoc abbreviations are intentional and may involve substantial differences from the original words.
arXiv Detail & Related papers (2021-10-04T01:22:43Z) - Handling Heavily Abbreviated Manuscripts: HTR engines vs text
normalisation approaches [0.0]
abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks.
We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text.
Case studies are drawn from the medieval Latin tradition.
arXiv Detail & Related papers (2021-07-07T19:23:22Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - What Does This Acronym Mean? Introducing a New Dataset for Acronym
Identification and Disambiguation [74.42107665213909]
Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing.
Due to their importance, identifying acronyms and corresponding phrases (AI) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding.
Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement.
arXiv Detail & Related papers (2020-10-28T00:12:36Z) - Fine-Grained Image Captioning with Global-Local Discriminative Objective [80.73827423555655]
We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions.
We evaluate the proposed method on the widely used MS-COCO dataset.
arXiv Detail & Related papers (2020-07-21T08:46:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.